Acquiring syntactic and semantic transformations in question answering
Abstract
One and the same fact in natural language can be expressed in many different ways by
using different words and/or a different syntax. This phenomenon, commonly called
paraphrasing, is the main reason why Natural Language Processing (NLP) is such a
challenging task. This becomes especially obvious in Question Answering (QA) where
the task is to automatically answer a question posed in natural language, usually in a
text collection also consisting of natural language texts. It cannot be assumed that an
answer sentence to a question uses the same words as the question and that these words
are combined in the same way by using the same syntactic rules.
In this thesis we describe methods that can help to address this problem. Firstly
we explore how lexical resources, i.e. FrameNet, PropBank and VerbNet can be used
to recognize a wide range of syntactic realizations that an answer sentence to a given
question can have. We find that our methods based on these resources work well for
web-based Question Answering. However we identify two problems: 1) All three resources
as of yet have significant coverage issues. 2) These resources are not suitable
to identify answer sentences that show some form of indirect evidence. While the
first problem hinders performance currently, it is not a theoretical problem that renders
the approach unsuitable–it rather shows that more efforts have to be made to produce
more complete resources. The second problem is more persistent. Many valid answer
sentences–especially in small, journalistic corpora–do not provide direct evidence for
a question, rather they strongly suggest an answer without logically implying it. Semantically
motivated resources like FrameNet, PropBank and VerbNet can not easily
be employed to recognize such forms of indirect evidence.
In order to investigate ways of dealing with indirect evidence, we used Amazon’s
Mechanical Turk to collect over 8,000 manually identified answer sentences from the
AQUAINT corpus to the over 1,900 TREC questions from the 2002 to 2006 QA tracks.
The pairs of answer sentences and their corresponding questions form the QASP corpus,
which we released to the public in April 2008. In this dissertation, we use the
QASP corpus to develop an approach to QA based on matching dependency relations
between answer candidates and question constituents in the answer sentences. By
acquiring knowledge about syntactic and semantic transformations from dependency
relations in the QASP corpus, additional answer candidates can be identified that could
not be linked to the question with our first approach.