0% found this document useful (0 votes)
7 views11 pages

Semantic Parsing Via Staged Query Graph Generation: Question Answering With Knowledge Base

This document presents a novel semantic parsing framework for question answering using knowledge bases, focusing on generating query graphs that can be mapped to logical forms. The proposed method improves efficiency by leveraging knowledge bases early in the process to prune the search space, ultimately achieving a significant performance boost on the WEB QUESTIONS dataset. The framework incorporates advanced entity linking and deep learning techniques to enhance relation matching and semantic understanding of questions.

Uploaded by

melvinengyf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views11 pages

Semantic Parsing Via Staged Query Graph Generation: Question Answering With Knowledge Base

This document presents a novel semantic parsing framework for question answering using knowledge bases, focusing on generating query graphs that can be mapped to logical forms. The proposed method improves efficiency by leveraging knowledge bases early in the process to prune the search space, ultimately achieving a significant performance boost on the WEB QUESTIONS dataset. The framework incorporates advanced entity linking and deep learning techniques to enhance relation matching and semantic understanding of questions.

Uploaded by

melvinengyf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Semantic Parsing via Staged Query Graph Generation:

Question Answering with Knowledge Base

Wen-tau YihMing-Wei Chang Xiaodong He Jianfeng Gao


Microsoft Research
Redmond, WA 98052, USA
{scottyih,minchang,xiaohe,jfgao}@microsoft.com

Abstract knowledge base, and thus are faced with sev-


eral challenges when adapted to applications like
We propose a novel semantic parsing QA. For instance, a generic meaning represen-
framework for question answering using a tation may have the ontology matching problem
knowledge base. We define a query graph when the logical form uses predicates that differ
that resembles subgraphs of the knowl- from those defined in the KB (Kwiatkowski et al.,
edge base and can be directly mapped to 2013). Even when the representation language
a logical form. Semantic parsing is re- is closely related to the knowledge base schema,
duced to query graph generation, formu- finding the correct predicates from the large vo-
lated as a staged search problem. Unlike cabulary in the KB to relations described in the
traditional approaches, our method lever- utterance remains a difficult problem (Berant and
ages the knowledge base in an early stage Liang, 2014).
to prune the search space and thus simpli- Inspired by (Yao and Van Durme, 2014; Bao et
fies the semantic matching problem. By al., 2014), we propose a semantic parsing frame-
applying an advanced entity linking sys- work that leverages the knowledge base more
tem and a deep convolutional neural net- tightly when forming the parse for an input ques-
work model that matches questions and tion. We first define a query graph that can be
predicate sequences, our system outper- straightforwardly mapped to a logical form in λ-
forms previous methods substantially, and calculus and is semantically closely related to λ-
achieves an F1 measure of 52.5% on the DCS (Liang, 2013). Semantic parsing is then re-
W EB Q UESTIONS dataset. duced to query graph generation, formulated as
a search problem with staged states and actions.
1 Introduction
Each state is a candidate parse in the query graph
Organizing the world’s facts and storing them representation and each action defines a way to
in a structured database, large-scale knowledge grow the graph. The representation power of the
bases (KB) like DBPedia (Auer et al., 2007) and semantic parse is thus controlled by the set of le-
Freebase (Bollacker et al., 2008) have become gitimate actions applicable to each state. In partic-
important resources for supporting open-domain ular, we stage the actions into three main steps:
question answering (QA). Most state-of-the-art locating the topic entity in the question, finding
approaches to KB-QA are based on semantic pars- the main relationship between the answer and the
ing, where a question (utterance) is mapped to its topic entity, and expanding the query graph with
formal meaning representation (e.g., logical form) additional constraints that describe properties the
and then translated to a KB query. The answers to answer needs to have, or relationships between the
the question can then be retrieved simply by exe- answer and other entities in the question.
cuting the query. The semantic parse also provides One key advantage of this staged design is
a deeper understanding of the question, which can that through grounding partially the utterance to
be used to justify the answer to users, as well as to some entities and predicates in the KB, we make
provide easily interpretable information to devel- the search far more efficient by focusing on the
opers for error analysis. promising areas in the space that most likely lead
However, most traditional approaches for se- to the correct query graph, before the full parse
mantic parsing are largely decoupled from the is determined. For example, after linking “Fam-

1321
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics
and the 7th International Joint Conference on Natural Language Processing, pages 1321–1331,
Beijing, China, July 26-31, 2015. c 2015 Association for Computational Linguistics
ily Guy” in the question “Who first voiced Meg
on Family Guy?” to FamilyGuy (the TV show) 12/26/1999
in the knowledge base, the procedure needs only

from
to examine the predicates that can be applied to Mila Kunis

FamilyGuy instead of all the predicates in the


KB. Resolving other entities also becomes easy, cvt1
Meg Griffin

as given the context, it is clear that Meg refers


to MegGriffin (the character in Family Guy).
Our design divides this particular semantic pars-
ing problem into several sub-problems, such as en- Family Guy cvt2 Lacey Chabert
tity linking and relation matching. With this in- series
tegrated framework, best solutions to each sub-
problem can be easily combined and help pro-
duce the correct semantic parse. For instance, cvt3 1/31/1999
an advanced entity linking system that we em-
ploy outputs candidate entities for each question
with both high precision and recall. In addi-
Figure 1: Freebase subgraph of Family Guy
tion, by leveraging a recently developed semantic
matching framework based on convolutional net-
works, we present better relation matching models because of its straightforward graphical represen-
using continuous-space representations instead of tation – each entity is a node and two related en-
pure lexical matching. Our semantic parsing ap- tities are linked by a directed edge labeled by the
proach improves the state-of-the-art result on the predicate, from the subject to the object entity.
W EB Q UESTIONS dataset (Berant et al., 2013) to To compare our approach to existing methods,
52.5% in F1 , a 7.2% absolute gain compared to we use Freebase, which is a large database with
the best existing method. more than 46 million topics and 2.6 billion facts.
The rest of this paper is structured as follows. In Freebase’s design, there is a special entity cate-
Sec. 2 introduces the basic notion of the graph gory called compound value type (CVT), which is
knowledge base and the design of our query graph. not a real-world entity, but is used to collect mul-
Sec. 3 presents our search-based approach for gen- tiple fields of an event or a special relationship.
erating the query graph. The experimental results Fig. 1 shows a small subgraph of Freebase re-
are shown in Sec. 4, and the discussion of our ap- lated to the TV show Family Guy. Nodes are the
proach and the comparisons to related work are in entities, including some dates and special CVT en-
Sec. 5. Finally, Sec. 6 concludes the paper. tities1 . A directed edge describes the relation be-
tween two entities, labeled by the predicate.
2 Background
2.2 Query graph
In this work, we aim to learn a semantic parser
Given the knowledge graph, executing a logical-
that maps a natural language question to a logi-
form query is equivalent to finding a subgraph that
cal form query q, which can be executed against a
can be mapped to the query and then resolving the
knowledge base K to retrieve the answers. Our ap-
binding of the variables. To capture this intuition,
proach takes a graphical view of both K and q, and
we describe a restricted subset of λ-calculus in a
reduces semantic parsing to mapping questions to
graph representation as our query graph.
query graphs. We describe the basic design below.
Our query graph consists of four types of nodes:
2.1 Knowledge base grounded entity (rounded rectangle), existential
variable (circle), lambda variable (shaded circle),
The knowledge base K considered in this work aggregation function (diamond). Grounded enti-
is a collection of subject-predicate-object triples ties are existing entities in the knowledge base K.
(e1 , p, e2 ), where e1 , e2 ∈ E are the entities (e.g., Existential variables and lambda variables are un-
FamilyGuy or MegGriffin) and p ∈ P is a 1
In the rest of the paper, we use the term entity for both
binary predicate like character. A knowledge real-world and CVT entities, as well as properties like date or
base in this form is often called a knowledge graph height. The distinction is not essential to our approach.

1322
Aa/Ac
argmin Meg Griffin Ae Ap Aa/Ac
f Se Sp Sc
Family Guy cast y x
Figure 3: The legitimate actions to grow a query
graph. See text for detail.
Figure 2: Query graph that represents the question
“Who first voiced Meg on Family Guy?” of the target knowledge base to retrieve answers.
Semantically, our query graph is more related to
grounded entities. In particular, we would like to simple λ-DCS (Berant et al., 2013; Liang, 2013),
retrieve all the entities that can map to the lambda which is a syntactic simplification of λ-calculus
variables in the end as the answers. Aggregation when applied to graph databases. A query graph
function is designed to operate on a specific entity, can be viewed as the tree-like graph pattern of a
which typically captures some numerical proper- logical form in λ-DCS. For instance, the path from
ties. Just like in the knowledge graph, related the answer node to an entity node can be described
nodes in the query graph are connected by directed using a series of join operations in λ-DCS. Differ-
edges, labeled with predicates in K. ent paths of the tree graph are combined via the
To demonstrate this design, Fig. 2 shows one intersection operators.
possible query graph for the question “Who first
voiced Meg on Family Guy?” using Freebase. 3 Staged Query Graph Generation
The two entities, MegGriffin and FamilyGuy We focus on generating query graphs with the fol-
are represented by two rounded rectangle nodes. lowing properties. First, the tree graph consists of
The circle node y means that there should exist one entity node as the root, referred as the topic
an entity describing some casting relations like entity. Second, there exists only one lambda vari-
the character, actor and the time she started the able x as the answer node, with a directed path
role2 . The shaded circle node x is also called from the root to it, and has zero or more existential
the answer node, and is used to map entities re- variables in-between. We call this path the core
trieved by the query. The diamond node arg min inferential chain of the graph, as it describes the
constrains that the answer needs to be the ear- main relationship between the answer and topic
liest actor for this role. Equivalently, the logi- entity. Variables can only occur in this chain, and
cal form query in λ-calculus without the aggrega- the chain only has variable nodes except the root.
tion function is: λx.∃y.cast(FamilyGuy, y) ∧ Finally, zero or more entity or aggregation nodes
actor(y, x) ∧ character(y, MegGriffin) can be attached to each variable node, including
Running this query graph against K as in the answer node. These branches are the addi-
Fig. 1 will match both LaceyChabert and tional constraints that the answers need to satisfy.
MilaKunis before applying the aggregation For example, in Fig. 2, FamilyGuy is the root
function, but only LaceyChabert is the correct and FamilyGuy → y → x is the core inferential
answer as she started this role earlier (by checking chain. The branch y → MegGriffin specifies
the from property of the grounded CVT node). the character and y → arg min constrains that the
Our query graph design is inspired by (Reddy answer needs to be the earliest actor for this role.
et al., 2014), but with some key differences. The
Given a question, we formalize the query
nodes and edges in our query graph closely re-
graph generation process as a search problem,
semble the exact entities and predicates from the
with staged states and actions.
S Let S =
knowledge base. As a result, the graph can
{φ, Se , Sp , Sc } be the set of states, where each
be straightforwardly translated to a logical form
state could be an empty graph (φ), a single-
query that is directly executable. In contrast, the
node graph with the topic entity (Se ), a core in-
query graph in (Reddy et al., 2014) is mapped
ferential chain (Sp ), or a more complex query
from the CCG parse of the question, and needs fur-
graph
S with additional constraints (Sc ). Let A =
ther transformations before mapping to subgraphs
{Ae , Ap , Ac , Aa } be the set of actions. An ac-
2
y should be grounded to a CVT entity in this case. tion grows a given graph by adding some edges

1323
s3
s1 Family Guy cast y actor x
Family Guy
s0 s1 s4
ϕ Family Guy Family Guy writer y start x
s2
Meg Griffin s5
Family Guy genre x

Figure 4: Two possible topic entity linking actions Figure 5: Candidate core inferential chains start
applied to an empty graph, for question “Who first from the entity FamilyGuy.
voiced [Meg] on [Family Guy]?”

consecutive word sequences that have occurred in


and nodes. In particular, Ae picks an entity node; the lexicon as possible mentions, paired with their
Ap determines the core inferential chain; Ac and possible entities. Each pair is then scored by a sta-
Aa add constraints and aggregation nodes, respec- tistical model based on its frequency counts in the
tively. Given a state, the valid action set can be de- surface-form lexicon. To tolerate potential mis-
fined by the finite state diagram in Fig. 3. Notice takes of the entity linking system, as well as ex-
that the order of possible actions is chosen for the ploring more possible query graphs, up to 10 top-
convenience of implementation. In principle, we ranked entities are considered as the topic entity.
could choose a different order, such as matching The linking score will also be used as a feature for
the core inferential chain first and then resolving the reward function.
the topic entity linking. However, since we will
consider multiple hypotheses during search, the 3.2 Identifying Core Inferential Chain
order of the staged actions can simply be viewed Given a state s that corresponds to a single-node
as a different way to prune the search space or to graph with the topic entity e, valid actions to ex-
bias the exploration order. tend this graph is to identify the core inferential
We define the reward function on the state space chain; namely, the relationship between the topic
using a log-linear model. The reward basically entity and the answer. For example, Fig. 5 shows
estimates the likelihood that a query graph cor- three possible chains that expand the single-node
rectly parses the question. Search is done using graph in s1 . Because the topic entity e is given,
the best-first strategy with a priority queue, which we only need to explore legitimate predicate se-
is formally defined in Appendix A. In the follow- quences that can start from e. Specifically, to re-
ing subsections, we use a running example of find- strict the search space, we explore all paths of
ing the semantic parse of question qex = “Who length 2 when the middle existential variable can
first voiced Meg of Family Guy?” to describe the be grounded to a CVT node and paths of length 1 if
sequence of actions. not. We also consider longer predicate sequences
if the combinations are observed in training data3 .
3.1 Linking Topic Entity Analogous to the entity linking problem, where
Starting from the initial state s0 , the valid actions the goal is to find the mapping of mentions to en-
are to create a single-node graph that corresponds tities in K, identifying the core inferential chain
to the topic entity found in the given question. For is to map the natural utterance of the question to
instance, possible topic entities in qex can either be the correct predicate sequence. For question “Who
FamilyGuy or MegGriffin, shown in Fig. 4. first voiced Meg on [Family Guy]?” we need to
measure the likelihood that each of the sequences
We use an entity linking system that is designed
in {cast-actor, writer-start, genre}
for short and noisy text (Yang and Chang, 2015).
correctly captures the relationship between Family
For each entity e in the knowledge base, the sys-
Guy and Who. We reduce this problem to measur-
tem first prepares a surface-form lexicon that lists
ing semantic similarity using neural networks.
all possible ways that e can be mentioned in text.
This lexicon is created using various data sources, 3
Decomposing relations in the utterance can be done us-
such as names and aliases of the entities, the an- ing decoding methods (e.g., (Bao et al., 2014)). However,
similar to ontology mismatch, the relation in text may not
chor text in Web documents and the Wikipedia re- have a corresponding single predicate, such as grandparent
direct table. Given a question, it considers all the needs to be mapped to parent-parent in Freebase.

1324
s3
Semantic layer: y 300
Family Guy cast y actor x
Semantic projection matrix: Ws
Max pooling layer: v 300
s6
... Meg Griffin

Max pooling operation max max ... max


... ... ...
Family Guy cast y actor x

Convolutional layer: ht 1000 1000 ... 1000


s7
argmin Meg Griffin
Convolution matrix: Wc
Word hashing layer: ft ...
15K 15K 15K 15K 15K
Family Guy y x
Word hashing matrix: Wf
Word sequence: xt <s> w1 w2 wT </s>

Figure 7: Extending an inferential chain with con-


straints and aggregation functions.
Figure 6: The architecture of the convolutional
neural networks (CNN) used in this work. The
CNN model maps a variable-length word se-
quence (e.g., a pattern or predicate sequence) to a word boundary symbol #. Then, it uses a convo-
low-dimensional vector in a latent semantic space. lutional layer to project the letter-trigram vectors
See text for the description of each layer. of words within a context window of 3 words to
a local contextual feature vector (ft → ht ), fol-
3.2.1 Deep Convolutional Neural Networks lowed by a max pooling layer that extracts the
To handle the huge variety of the semantically most salient local features to form a fixed-length
equivalent ways of stating the same question, as global feature vector (v). The global feature vector
well as the mismatch of the natural language ut- is then fed to feed-forward neural network layers
terances and predicates in the knowledge base, we to output the final non-linear semantic features (y),
propose using Siamese neural networks (Brom- as the vector representation of either the pattern or
ley et al., 1993) for identifying the core inferen- the inferential chain.
tial chain. For instance, one of our constructions Training the model needs positive pairs, such as
maps the question to a pattern by replacing the en- a pattern like “who first voiced meg on <e>” and
tity mention with a generic symbol <e> and then an inferential chain like cast-actor. These
compares it with a candidate chain, such as “who pairs can be extracted from the full semantic
first voiced meg on <e>” vs. cast-actor. The parses when provided in the training data. If the
model consists of two neural networks, one for correct semantic parses are latent and only the
the pattern and the other for the inferential chain. pairs of questions and answers are available, such
Both are mapped to k-dimensional vectors as the as the case in the W EB Q UESTIONS dataset, we
output of the networks. Their semantic similar- can still hypothesize possible inferential chains by
ity is then computed using some distance func- traversing the paths in the knowledge base that
tion, such as cosine. This continuous-space rep- connect the topic entity and the answer. Sec. 4.1
resentation approach has been proposed recently will illustrate this data generation process in detail.
for semantic parsing and question answering (Bor-
des et al., 2014a; Yih et al., 2014) and has shown Our model has two advantages over the embed-
better results compared to lexical matching ap- ding approach (Bordes et al., 2014a). First, the
proaches (e.g., word-alignment models). In this word hashing layer helps control the dimensional-
work, we adapt a convolutional neural network ity of the input space and can easily scale to large
(CNN) framework (Shen et al., 2014b; Shen et al., vocabulary. The letter-trigrams also capture some
2014a; Gao et al., 2014) to this matching problem. sub-word semantics (e.g., words with minor ty-
The network architecture is illustrated in Fig. 6. pos have almost identical letter-trigram vectors),
The CNN model first applies a word hashing which makes it especially suitable for questions
technique (Huang et al., 2013) that breaks a word from real-world users, such as those issued to a
into a vector of letter-trigrams (xt → ft in Fig. 6). search engine. Second, it uses a deeper archi-
For example, the bag of letter-trigrams of the word tecture with convolution and max-pooling layers,
“who” are #-w-h, w-h-o, h-o-# after adding the which has more representation power.

1325
3.3 Augmenting Constraints & Aggregations is the correct semantic parse of the input ques-
tion q. We use a log-linear model to learn the re-
A graph with just the inferential chain forms the
ward function. Below we describe the features and
simplest legitimate query graph and can be exe-
the learning process.
cuted against the knowledge base K to retrieve
the answers; namely, all the entities that x can 3.4.1 Features
be grounded to. For instance, the graph in s3 in
The features we designed essentially match spe-
Fig. 7 will retrieve all the actors who have been on
cific portions of the graph to the question, and gen-
FamilyGuy. Although this set of entities obvi-
erally correspond to the staged actions described
ously contains the correct answer to the question
previously, including:
(assuming the topic entity FamilyGuy is correct),
it also includes incorrect entities that do not sat- Topic Entity The score returned by the entity
isfy additional constraints implicitly or explicitly linking system is directly used as a feature.
mentioned in the question.
To further restrict the set of answer entities, the Core Inferential Chain We use similarity
graph with only the core inferential chain can be scores of different CNN models described in
expanded by two types of actions: Ac and Aa . Ac Sec. 3.2.1 to measure the quality of the core infer-
is the set of possible ways to attach an entity to a ential chain. PatChain compares the pattern (re-
variable node, where the edge denotes one of the placing the topic entity with an entity symbol) and
valid predicates that can link the variable to the the predicate sequence. QuesEP concatenates the
entity. For instance, in Fig. 7, s6 is created by canonical name of the topic entity and the predi-
attaching MegGriffin to y with the predicate cate sequence, and compares it with the question.
character. This is equivalent to the last con- This feature conceptually tries to verify the entity
junctive term in the corresponding λ-expression: linking suggestion. These two CNN models are
λx.∃y.cast(FamilyGuy, y) ∧ actor(y, x) ∧ learned using pairs of the question and the infer-
character(y, MegGriffin). Sometimes, the ential chain of the parse in the training data. In
constraints are described over the entire answer addition to the in-domain similarity features, we
set through the aggregation function, such as the also train a ClueWeb model using the Freebase
word “first” in our example question qex . This is annotation of ClueWeb corpora (Gabrilovich et al.,
handled similarly by actions Aa , which attach an 2013). For two entities in a sentence that can be
aggregation node on a variable node. For exam- linked by one or two predicates, we pair the sen-
ple, the arg min node of s7 in Fig. 7 chooses the tences and predicates to form a parallel corpus to
grounding with the smallest from attribute of y. train the CNN model.
The full possible constraint set can be derived
Constraints & Aggregations When a con-
by first issuing the core inferential chain as a query
straint node is present in the graph, we use some
to the knowledge base to find the bindings of vari-
simple features to check whether there are words
ables y’s and x, and then enumerating all neigh-
in the question that can be associated with the con-
boring nodes of these entities. This, however,
straint entity or property. Examples of such fea-
often results in an unnecessarily large constraint
tures include whether a mention in the question
pool. In this work, we employ simple rules to re-
can be linked to this entity, and the percentage of
tain only the nodes that have some possibility to be
the words in the name of the constraint entity ap-
legitimate constraints. For instance, a constraint
pear in the question. Similarly, we check the ex-
node can be an entity that also appears in the ques-
istence of some keywords in a pre-compiled list,
tion (detected by our entity linking component), or
such as “first”, “current” or “latest” as features for
an aggregation constraint can only be added if cer-
aggregation nodes such as arg min. The complete
tain keywords like “first” or “latest” occur in the
list of these simple word matching features can
question. The complete set of these rules can be
also be found in Appendix B.
found in Appendix B.
Overall The number of the answer entities re-
3.4 Learning the reward function
trieved when issuing the query to the knowledge
Given a state s, the reward function γ(s) basically base and the number of nodes in the query graph
judges whether the query graph represented by s are both included as features.

1326
q = Who first voiced Meg on Family Guy? 4 Experiments
s
argmin Meg Griffin We first introduce the dataset and evaluation met-
ric, followed by the main experimental results and
Family Guy cast y actor x some analysis.

(1) EntityLinkingScore(FamilyGuy, Family Guy ) = 0.9 4.1 Data & evaluation metric
(2) PatChain( who first voiced meg on <e> , cast-actor) = 0.7
We use the W EB Q UESTIONS dataset (Berant
(3) QuesEP(q, family guy cast-actor ) = 0.6
(4) ClueWeb( who first voiced meg on <e> , cast-actor) = 0.2 et al., 2013), which consists of 5,810 ques-
(5) ConstraintEntityWord( Meg Griffin , q) = 0.5 tion/answer pairs. These questions were collected
(6) ConstraintEntityInQ( Meg Griffin , q) = 1 using Google Suggest API and the answers were
(7) AggregationKeyword(argmin, q) = 1
(8) NumNodes(s) = 5
obtained from Freebase with the help of Amazon
(9) NumAns(s) = 1 MTurk. The questions are split into training and
testing sets, which contain 3,778 questions (65%)
Figure 8: Active features of a query graph s. (1) and 2,032 questions (35%), respectively. This
is the entity linking score of the topic entity. (2)- dataset has several unique properties that make it
(4) are different model scores of the core chain. appealing and was used in several recent papers
(5) indicates 50% of the words in “Meg Griffin” on semantic parsing and question answering. For
appear in the question q. (6) is 1 when the mention instance, although the questions are not directly
“Meg” in q is correctly linked to MegGriffin sampled from search query logs, the selection pro-
by the entity linking component. (8) is the number cess was still biased to commonly asked questions
of nodes in s. The knowledge base returns only 1 on a search engine. The distribution of this ques-
entity when issuing this query, so (9) is 1. tion set is thus closer to the “real” information
need of search users than that of a small number
of human editors. The system performance is ba-
To illustrate our feature design, Fig. 8 presents
sically measured by the ratio of questions that are
the active features of an example query graph.
answered correctly. Because there can be more
3.4.2 Learning than one answer to a question, precision, recall
and F1 are computed based on the system output
In principle, once the features are extracted, the
for each individual question. The average F1 score
model can be trained using any standard off-the-
is reported as the main evaluation metric6 .
shelf learning algorithm. Instead of treating it as a
Because this dataset contains only question and
binary classification problem, where only the cor-
answer pairs, we use essentially the same search
rect query graphs are labeled as positive, we view
procedure to simulate the semantic parses for
it as a ranking problem. Suppose we have several
training the CNN models and the overall reward
candidate query graphs for each question4 . Let ga
function. Candidate topic entities are first gener-
and gb be the query graphs described in states sa
ated using the same entity linking system for each
and sb for the same question q, and the entity sets
question in the training data. Paths on the Free-
Aa and Ab be those retrieved by executing ga and
base knowledge graph that connect a candidate
gb , respectively. Suppose that A is the labeled an-
entity to at least one answer entity are identified
swers to q. We first compute the precision, recall
as the core inferential chains7 . If an inferential-
and F1 score of Aa and Ab , compared with the
chain query returns more entities than the correct
gold answer set A. We then rank sa and sb by their
answers, we explore adding constraint and aggre-
F1 scores5 . The intuition behind is that even if a
gation nodes, until the entities retrieved by the
query is not completely correct, it is still preferred
query graph are identical to the labeled answers, or
than some other totally incorrect queries. In this
the F1 score cannot be increased further. Negative
work, we use a one-layer neural network model
examples are sampled from of the incorrect can-
based on lambda-rank (Burges, 2010) for training
didate graphs generated during the search process.
the ranker.
6
We used the official evaluation script from http://
4
We will discuss how to create these candidate query www-nlp.stanford.edu/software/sempre/.
graphs from question/answer pairs in Sec. 4.1. 7
We restrict the path length to 2. In principle, parses of
5
We use F1 partially because it is the evaluation metric shorter chains can be used to train the initial reward function,
used in the experiments. for exploring longer paths using the same search procedure.

1327
Method Prec. Rec. F1 Method #Entities # Covered Ques. # Labeled Ent.
(Berant et al., 2013) 48.0 41.3 35.7 Freebase API 19,485 3,734 (98.8%) 3,069 (81.2%)
(Bordes et al., 2014b) - - 29.7 Ours 9,147 3,770 (99.8%) 3,318 (87.8%)
(Yao and Van Durme, 2014) - - 33.0
(Berant and Liang, 2014) 40.5 46.6 39.9 Table 2: Statistics of entity linking results on train-
(Bao et al., 2014) - - 37.5
(Bordes et al., 2014a) - - 39.2
ing set questions. Both methods cover roughly the
(Yang et al., 2014) - - 41.3 same number of questions, but Freebase API sug-
(Wang et al., 2014) - - 45.3 gests twice the number of entities output by our
Our approach – STAGG 52.8 60.7 52.5
entity linking system and covers fewer topic enti-
Table 1: The results of our approach compared to ties labeled in the original data.
existing work. The numbers of other systems are
either from the original papers or derived from the system outperforms the previous state-of-the-art
evaluation script, when the output is available. method by a large margin – 7.2% absolute gain.
Given the staged design of our approach, it is
In the end, we produce 17,277 query graphs with thus interesting to examine the contributions of
none-zero F1 scores from the training set questions each component. Because topic entity linking is
and about 1.7M completely incorrect ones. the very first stage, the quality of the entities found
For training the CNN models to identify the in the questions, both in precision and recall, af-
core inferential chain (Sec. 3.2.1), we only fects the final results significantly. To get some
use 4,058 chain-only query graphs that achieve insight about how our topic entity linking com-
F1 = 0.5 to form the parallel question and pred- ponent performs, we also experimented with ap-
icate sequence pairs. The hyper-parameters in plying Freebase Search API to suggest entities for
CNN, such as the learning rate and the numbers possible mentions in a question. As can be ob-
of hidden nodes at the convolutional and semantic served in Tab. 2, to cover most of the training
layers were chosen via cross-validation. We re- questions, we only need half of the number of
served 684 pairs of patterns and inference-chains suggestions when using our entity linking compo-
from the whole training examples as the held-out nent, compared to Freebase API. Moreover, they
set, and the rest as the initial training set. The also cover more entities that were selected as the
optimal hyper-parameters were determined by the topic entities in the original dataset. Starting from
performance of models trained on the initial train- those 9,147 entities output by our component, an-
ing set when applied to the held-out data. We swers of 3,453 questions (91.4%) can be found in
then fixed the hyper-parameters and retrained the their neighboring nodes. When replacing our en-
CNN models using the whole training set. The tity linking component with the results from Free-
performance of CNN is insensitive to the hyper- base API, we also observed a significant perfor-
parameters as long as they are in a reasonable mance degradation. The overall system perfor-
range (e.g., 1000 ± 200 nodes in the convolutional mance drops from 52.5% to 48.4% in F1 (Prec =
layer, 300 ± 100 nodes in the semantic layer, and 49.8%, Rec = 55.7%), which is 4.1 points lower.
learning rate 0.05 ∼ 0.005) and the training pro- Next we test the system performance when the
cess often converges after ∼ 800 epochs. query graph has just the core inferential chain.
When training the reward function, we created Tab. 3 summarizes the results. When only the
up to 4,000 examples for each question that con- PatChain CNN model is used, the performance
tain all the positive query graphs and randomly se- is already very strong, outperforming all existing
lected negative examples. The model is trained as work. Adding the other CNN models boosts the
a ranker, where example query graphs are ranked performance further, reaching 51.8% and is only
by their F1 scores. slightly lower than the full system performance.
This may be due to two reasons. First, the ques-
4.2 Results tions from search engine users are often short and
Tab. 1 shows the results of our system, STAGG a large portion of them simply ask about properties
(Staged query graph generation), compared to ex- of an entity. Examining the query graphs gener-
isting work8 . As can be seen from the table, our ated for training set questions, we found that 1,888
8
We do not include results of (Reddy et al., 2014) because directly comparable to results from other work. On these 570
they used only a subset of 570 test questions, which are not questions, our system achieves 67.0% in F1 .

1328
Method Prec. Rec. F1 graph generation method is inspired by (Yao and
PatChain 48.8 59.3 49.6 Van Durme, 2014; Bao et al., 2014). Unlike tra-
+QuesEP 50.7 60.6 50.9
+ClueWeb 51.3 62.6 51.8 ditional semantic parsing approaches, it uses the
knowledge base to help prune the search space
Table 3: The system results when only the when forming the parse. Similar ideas have also
inferential-chain query graphs are generated. We been explored in (Poon, 2013).
started with the PatChain CNN model and then Empirically, our results suggest that it is cru-
added QuesEP and ClueWeb sequentially. See cial to identify the core inferential chain, which
Sec. 3.4 for the description of these models. matches the relationship between the topic en-
tity in the question and the answer. Our CNN
(50.0%) can be answered exactly (i.e., F1 = 1) us- models can be analogous to the embedding ap-
ing a chain-only query graph. Second, even if the proaches (Bordes et al., 2014a; Yang et al., 2014),
correct parse requires more constraints, the less but are more sophisticated. By allowing param-
constrained graph still gets a partial score, as its eter sharing among different question-pattern and
results cover the correct answers. KB predicate pairs, the matching score of a rare
or even unseen pair in the training data can still be
4.3 Error Analysis predicted precisely. This is due to the fact that the
Although our approach substantially outperforms prediction is based on the shared model parame-
existing methods, the room for improvement ters (i.e., projection matrices) that are estimated
seems big. After all, the accuracy for the intended using all training pairs.
application, question answering, is still low and 6 Conclusion
only slightly above 50%. We randomly sampled
100 questions that our system did not generate In this paper, we present a semantic parsing frame-
the completely correct query graphs, and catego- work for question answering using a knowledge
rized the errors. About one third of errors are in base. We define a query graph as the meaning rep-
fact due to label issues and are not real mistakes. resentation that can be directly mapped to a logical
This includes label error (2%), incomplete labels form. Semantic parsing is reduced to query graph
(17%, e.g., only one song is labeled as the an- generation, formulated as a staged search prob-
swer to “What songs did Bob Dylan write?”) and lem. With the help of an advanced entity linking
acceptable answers (15%, e.g., “Time in China” system and a deep convolutional neural network
vs. “UTC+8”). 8% of the errors are due to incor- model that matches questions and predicate se-
rect entity linking; however, sometimes the men- quences, our system outperforms previous meth-
tion is inherently ambiguous (e.g., AFL in “Who ods substantially on the W EB Q UESTIONS dataset.
founded the AFL?” could mean either “American In the future, we would like to extend our query
Football League” or “American Federation of La- graph to represent more complicated questions,
bor”). 35% of the errors are because of the incor- and explore more features and models for match-
rect inferential chains; 23% are due to incorrect or ing constraints and aggregation functions. Apply-
missing constraints. ing other structured-output prediction methods to
graph generation will also be investigated.
5 Related Work and Discussion
Acknowledgments
Several semantic parsing methods use a domain-
independent meaning representation derived from We thank the anonymous reviewers for their
the combinatory categorial grammar (CCG) parses thoughtful comments, Ming Zhou, Nan Duan and
(e.g., (Cai and Yates, 2013; Kwiatkowski et al., Xuchen Yao for sharing their experience on the
2013; Reddy et al., 2014)). In contrast, our query question answering problem studied in this work,
graph design matches closely the graph knowl- and Chris Meek for his valuable suggestions. We
edge base. Although not fully demonstrated in are also grateful to Siva Reddy and Jonathan Be-
this paper, the query graph can in fact be fairly ex- rant for providing us additional data.
pressive. For instance, negations can be handled
Appendix
by adding tags to the constraint nodes indicating
that certain conditions cannot be satisfied. Our See supplementary notes.

1329
References Sofia, Bulgaria, August. Association for Computa-
tional Linguistics.
Sören Auer, Christian Bizer, Georgi Kobilarov, Jens
Lehmann, Richard Cyganiak, and Zachary Ives. Evgeniy Gabrilovich, Michael Ringgaard, and Amar-
2007. DBpedia: A nucleus for a web of open data. nag Subramanya. 2013. FACC1: Freebase annota-
In The semantic web, pages 722–735. Springer. tion of ClueWeb corpora, version 1. Technical re-
port, June.
Junwei Bao, Nan Duan, Ming Zhou, and Tiejun Zhao.
2014. Knowledge-based question answering as ma- Jianfeng Gao, Patrick Pantel, Michael Gamon, Xi-
chine translation. In Proceedings of the 52nd An- aodong He, Li Deng, and Yelong Shen. 2014. Mod-
nual Meeting of the Association for Computational eling interestingness with deep neural networks. In
Linguistics (Volume 1: Long Papers), pages 967– Proceedings of the 2013 Conference on Empirical
976, Baltimore, Maryland, June. Association for Methods in Natural Language Processing.
Computational Linguistics.
Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng,
Jonathan Berant and Percy Liang. 2014. Seman- Alex Acero, and Larry Heck. 2013. Learning deep
tic parsing via paraphrasing. In Proceedings of the structured semantic models for Web search using
52nd Annual Meeting of the Association for Compu- clickthrough data. In Proceedings of the 22nd ACM
tational Linguistics (Volume 1: Long Papers), pages international conference on Conference on informa-
1415–1425, Baltimore, Maryland, June. Association tion & knowledge management, pages 2333–2338.
for Computational Linguistics. ACM.

Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Tom Kwiatkowski, Eunsol Choi, Yoav Artzi, and Luke
Liang. 2013. Semantic parsing on Freebase from Zettlemoyer. 2013. Scaling semantic parsers with
question-answer pairs. In Proceedings of the 2013 on-the-fly ontology matching. In Proceedings of
Conference on Empirical Methods in Natural Lan- the 2013 Conference on Empirical Methods in Natu-
guage Processing, pages 1533–1544, Seattle, Wash- ral Language Processing, pages 1545–1556, Seattle,
ington, USA, October. Association for Computa- Washington, USA, October. Association for Compu-
tional Linguistics. tational Linguistics.

Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Percy Liang. 2013. Lambda dependency-based com-
Sturge, and Jamie Taylor. 2008. Freebase: A positional semantics. Technical report, arXiv.
collaboratively created graph database for structur-
Hoifung Poon. 2013. Grounded unsupervised seman-
ing human knowledge. In Proceedings of the 2008
tic parsing. In Annual Meeting of the Association for
ACM SIGMOD International Conference on Man-
Computational Linguistics (ACL), pages 933–943.
agement of Data, SIGMOD ’08, pages 1247–1250,
New York, NY, USA. ACM. Siva Reddy, Mirella Lapata, and Mark Steedman.
2014. Large-scale semantic parsing without
Antoine Bordes, Sumit Chopra, and Jason Weston. question-answer pairs. Transactions of the Associ-
2014a. Question answering with subgraph embed- ation for Computational Linguistics, 2:377–392.
dings. In Proceedings of the 2014 Conference on
Empirical Methods in Natural Language Process- Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng,
ing (EMNLP), pages 615–620, Doha, Qatar, Octo- and Gregoire Mesnil. 2014a. A latent semantic
ber. Association for Computational Linguistics. model with convolutional-pooling structure for in-
formation retrieval. In Proceedings of the 23rd ACM
Antoine Bordes, Jason Weston, and Nicolas Usunier. International Conference on Conference on Infor-
2014b. Open question answering with weakly su- mation and Knowledge Management, pages 101–
pervised embedding models. In Proceedings of 110. ACM.
ECML-PKDD.
Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng,
Jane Bromley, James W. Bentz, Léon Bottou, Is- and Grégoire Mesnil. 2014b. Learning semantic
abelle Guyon, Yann LeCun, Cliff Moore, Eduard representations using convolutional neural networks
Säckinger, and Roopak Shah. 1993. Signature ver- for web search. In Proceedings of the Companion
ification using a “Siamese” time delay neural net- Publication of the 23rd International Conference on
work. International Journal Pattern Recognition World Wide Web Companion, pages 373–374.
and Artificial Intelligence, 7(4):669–688.
Zhenghao Wang, Shengquan Yan, Huaming Wang, and
Christopher JC Burges. 2010. From RankNet to Xuedong Huang. 2014. An overview of Microsoft
LambdaRank to LambdaMart: An overview. Learn- Deep QA system on Stanford WebQuestions bench-
ing, 11:23–581. mark. Technical Report MSR-TR-2014-121, Mi-
crosoft, Sep.
Qingqing Cai and Alexander Yates. 2013. Large-
scale semantic parsing via schema matching and lex- Yi Yang and Ming-Wei Chang. 2015. S-MART: Novel
icon extension. In Proceedings of the 51st Annual tree-based structured learning algorithms applied to
Meeting of the Association for Computational Lin- tweet entity linking. In Annual Meeting of the Asso-
guistics (Volume 1: Long Papers), pages 423–433, ciation for Computational Linguistics (ACL).

1330
Min-Chul Yang, Nan Duan, Ming Zhou, and Hae-
Chang Rim. 2014. Joint relational embeddings for
knowledge-based question answering. In Proceed-
ings of the 2014 Conference on Empirical Methods
in Natural Language Processing (EMNLP), pages
645–650, Doha, Qatar, October. Association for
Computational Linguistics.
Xuchen Yao and Benjamin Van Durme. 2014. Infor-
mation extraction over structured data: Question an-
swering with Freebase. In Proceedings of the 52nd
Annual Meeting of the Association for Computa-
tional Linguistics (Volume 1: Long Papers), pages
956–966, Baltimore, Maryland, June. Association
for Computational Linguistics.
Wen-tau Yih, Xiaodong He, and Christopher Meek.
2014. Semantic parsing for single-relation ques-
tion answering. In Proceedings of the 52nd Annual
Meeting of the Association for Computational Lin-
guistics (Volume 2: Short Papers), pages 643–648,
Baltimore, Maryland, June. Association for Compu-
tational Linguistics.

1331

You might also like