Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks
There is a rich history of the use of synthetic tasks Our tasks are built with a unified underlying simulation
in machine learning, from the XOR problem which of a physical world, akin to a classic text adventure game
helped motivate neural networks (Minsky & Papert, 1969; (Montfort, 2005) whereby actors move around manipulat-
Rumelhart et al., 1985), to circle, spiral and ring datasets ing objects and interacting with each other. As the simula-
that helped motivate some of the most well-known clus- tion runs, grounded text and question answer pairs are si-
tering and semi-supervised learning algorithms (Ng et al., multaneously generated. Our goal is to categorize different
2002; Zhu et al., 2003), Mackey Glass equations for time kinds of questions into skill sets, which become our tasks.
series (Müller et al., 1997), and so on – in fact some of Our hope is that the analysis of performance on these tasks
the well known UCI datasets (Bache & Lichman, 2013) will help expose weaknesses of current models and help
are synthetic as well (e.g., waveform). Recent work con- motivate new algorithm designs that alleviate these weak-
tinues this trend. For example, in the area of developing nesses. We further envision this as a feedback loop where
learning algorithms with a memory component synthetic new tasks can then be designed in response, perhaps in an
datasets were used to help develop both the Neural Turing adversarial fashion, in order to break the new models.
Machine of Graves et al. (2014) and the Memory Networks
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks
The tasks we design are detailed in Section 3, and the simu- Based on these observations, we chose to conceive a collec-
lation used to generate them in Section 4. In Sec. 6 we give tion of much simpler QA tasks, with the main objective that
benchmark results of standard methods on our tasks, and failure or success of a system on any of them can unequiv-
analyse their successes and failures. In order to exemplify ocally provide feedback on its capabilities. In that, we are
the kind of feedback loop between algorithm development close to the Winograd Schema Challenge (Levesque et al.,
and task development we envision, in Section 5 we pro- 2011), which is organized around simple statements fol-
pose a set of improvements to the recent Memory Network lowed by a single binary choice question such as: “Joan
method, which has shown to give promising performance made sure to thank Susan for all the help she had received.
in QA. We show our proposed approach does indeed give Who had received the help? Joan or Susan?”. In this chal-
improved performance on some tasks, but is still unable to lenge, and our tasks, it is straightforward to interpret re-
solve some of them, which we consider as open problems. sults. Yet, where the Winograd Challenge is mostly cen-
tered around evaluating if systems can acquire and make
2. Related Work use of background knowledge that is not expressed in the
words of the statement, our tasks are self-contained and are
Several projects targeting language understanding using more diverse. By self-contained we mean our tasks come
QA-based strategies have recently emerged. Unlike tasks with both training data and evaluation data, rather than just
like dialogue or summarization, QA is easy to evaluate the latter as in the case of ARISTO and the Winograd Chal-
(especially in true/false or multiple choice scenarios) and lenge. MCTest has a train/test split but the training set is
hence makes it an appealing research avenue. The difficulty likely too small to capture all the reasoning needed to do
lies in the definition of questions: they must be unambigu- well on the test set. In our setup one can assess the amount
ously answerable by adult humans (or children), but still of training examples needed to perform well (which can
require some thinking. The Allen Institute for AI’s flagship be increased as desired) and commonsense knowledge and
project ARISTO1 is organized around a collection of QA reasoning required for the test set should be contained in
tasks derived from increasingly difficult science exams, at the training set. In terms of diversity, some of our tasks
the 4th, 8th, and 12th grade levels. Richardson et al. (2013) are related to existing setups but we also propose many ad-
proposed the MCTest2 a set of 660 stories and associated ditional ones; tasks 3.8 and 3.9 are inspired by previous
questions intended for research on the machine compre- work on lambda dependency-based compositional seman-
hension of text. Each question requires the reader to un- tics (Liang et al., 2013; Liang, 2013) for instance. For us,
derstand different aspects of the story. each task checks one skill that the system must have and
we postulate that performing well on all of them is a pre-
These two initiatives go in a promising direction but in-
requisite for any system aiming at full text understanding
terpreting the results on these benchmarks remain compli-
and reasoning.
cated. Indeed, no system has yet been able to fully solve
the proposed tasks and since many sub-tasks need to be
solved to answer any of their questions (coreference, de- 3. The Tasks
duction, use of common-sense, etc.), it is difficult to clearly
Our main idea is to provide a set of tasks, in a similar way
identify capabilities and limitations of these systems and
to how software testing is built in computer science. Ideally
hence to propose improvements and modifications. As a
each task is a “leaf” test case, as independent from others
result, conclusions drawn from these projects are not much
as possible, and tests in the simplest way possible one as-
clearer than that coming from more traditional works on
pect of intended behavior. Subsequent (“non-leaf”) tests
QA over large-scale Knowledge Bases (Berant et al., 2013;
can build on these by testing combinations as well. The
Fader et al., 2014). Besides, the best performing systems
tasks are publicly available at http://fb.ai/babi.
are based on hand-crafted patterns and features, and/or
statistics acquired on very large corpora. It is difficult Each task provides a set of training and test data, with
to argue that such systems actually understand language the intention that a successful model performs well on test
and are not simply light upgrades of traditional informa- data. Following (Weston et al., 2014), the supervision in
tion extraction methods (Yao et al., 2014). The system of the training set is given by the true answers to questions,
(Berant et al., 2014) is more evolved since it builds a struc- and the set of relevant statements for answering a given
tured representation of a text and of a question to answer. question, which may or may not be used by the learner. We
Despite its potential this method remains highly domain set up the tasks so that correct answers are limited to a sin-
specific and relies on a lot of prior knowledge. gle word (Q: Where is Mark? A: bathroom), or else a list
1
of words (Q: What is Mark holding?) as evaluation is then
http://allenai.org/aristo.html
2 clear-cut, and is measured simply as right or wrong.
http://research.microsoft.com/mct
achieve 100% accuracy. We tried to choose tasks that are Manoj gendh le kar aaya.
natural to a human reader, and no background in areas such Manoj gusalkhaney mein chala gaya.
as formal semantics, machine learning, logic or knowledge Gendh is samay kahan hai? A: gusalkhana
representation is required for an adult to solve them. Manoj daftar gaya.
Priya bagichey gayi.
The data itself is produced using a simple simulation of Gendh ab kahan hai? A: daftar
characters and objects moving around and interacting in lo-
cations, described in Section 4. The simulation allows us to
3.3. Factoid QA with Three Supporting Facts
generate data in many different scenarious where the true
labels are known by grounding to the simulation. For each Similarly, one can make a task with three supporting facts:
task, we describe it by giving a small sample of the dataset
John picked up the apple.
including statements, questions and the true labels (in red). John went to the office.
John went to the kitchen.
3.1. Basic Factoid QA with Single Supporting Fact John dropped the apple.
Where was the apple before the kitchen? A:office
Our first task consists of questions where a single support-
ing fact that has been previously given provides the answer. The first three statements are all required to answer this.
We first test one of the simplest cases of this, by asking for
the location of a person. A small sample of the task is thus: 3.4. Two Argument Relations: Subject vs. Object
John is in the playground. To answer questions the ability to differentiate and recog-
Bob is in the office. nize subjects and objects is crucial. We consider here the
Where is John? A:playground extreme case where sentences feature re-ordered words, i.e.
a bag-of-words will not work:
This kind of synthetic data was already used in
(Weston et al., 2014). It can be considered the simplest The office is north of the bedroom.
case of some real world QA datasets such as in (Fader et al., The bedroom is north of the bathroom.
2013). What is north of the bedroom? A: office
What is the bedroom north of? A: bathroom
3.2. Factoid QA with Two Supporting Facts Note that the two questions above have exactly the same
A harder task is to answer questions where two supporting words, but in a different order, and different answers.
statements have to be chained to answer the question:
3.5. Three Argument Relations
John is in the playground.
Bob is in the office. Similarly, sometimes one needs to differentiate three sepa-
John picked up the football. rate arguments, such as in the following task:
Bob went to the kitchen.
Where is the football? A:playground Mary gave the cake to Fred.
Fred gave the cake to Bill.
For example, to answer the question Where is the football?, Jeff was given the milk by Bill.
both John picked up the football and John is in the play- Who gave the cake to Fred? A: Mary
Who did Fred give the cake to? A: Bill
ground are supporting facts. Again, this kind of task was What did Jeff receive? A: milk
already used in (Weston et al., 2014). Who gave the milk? A: Bill
Note that, to show the difficulty of these tasks for a learning The last question is potentially the hardest for a learner as
machine with no other knowledge we can shuffle the letters the first two can be answered by providing the actor that is
of the alphabet and produce equivalent datasets: not mentioned in the question.
Sbdm ip im vdu yonrckblms.
Abf ip im vdu bhhigu. 3.6. Yes/No questions
Sbdm yigaus ly vdu hbbvfnoo.
Abf zumv vb vdu aivgdum. This task tests, on some of the simplest questions possible
Mduku ip vdu hbbvfnoo? A:yonrckblms (specifically, ones with a single supporting fact) the ability
of a model to answer true/false type questions:
We can also use the simulation to generate other languages
other than English. We thus produced the same set of tasks John is in the playground.
Daniel picks up the milk.
in Hindi, e.g. for this task: Is John in the classroom? A:no
Does Daniel have the milk? A:yes
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks
3.7. Counting
Sandra was in the office.
This task tests the ability of the QA system to perform sim- Where is Daniel? A:studio
ple counting operations, by asking about the number of ob-
jects with a certain property: Real-world data typically addresses this as a label-
ing problem and studies more sophisticated phenomena
Daniel picked up the football.
(Haghighi & Klein, 2009). A RISTO also addresses this
Daniel dropped the football.
Daniel got the milk. task.
Daniel took the apple.
How many objects is Daniel holding? A: two 3.12. Conjunction
This task tests referring to multiple subjects in a single
3.8. Lists / Sets statement, for example:
While many of our tasks are designed to have single word Mary and Jeff went to the kitchen.
answers for simplicity, this set of tasks tests the ability to Then Jeff went to the park.
produce a set of single word answers in the form of a list, Where is Mary? A: kitchen
by asking about sets of entities with certain properties, e.g.:
Daniel picks up the football. 3.13. Compound Coreference
Daniel drops the newspaper.
Daniel picks up the milk. This task tests coreference in the case where the pronoun
What is Daniel holding? milk, football can refer to multiple actors, for example:
The task above can be seen as a QA task related to a Daniel and Sandra journeyed to the office.
database search operation. Note that we could also con- Then they went to the garden.
sider the following question types: intersection (Who is in Sandra and John travelled to the kitchen.
After that they moved to the hallway.
the park carrying food?), union (Who has milk or cook- Where is Daniel? A: garden
ies?) and set difference (Who is in the park apart from
Bill?). However, we leave those for future work.
3.14. Time Manipulation
3.9. Simple Negation
While our tasks so far have included time implicitly in the
We test one of the simplest types of negation, that of sup- order of the statements, this task tests understanding the
porting facts that imply a statement is false: use of time expressions within the statements, for example:
Sandra travelled to the office. In the afternoon Julie went to the park.
Fred is no longer in the office. Yesterday Julie was at school.
Is Fred in the office? A:no Julie went to the cinema this evening.
Is Sandra in the office? A:yes Where did Julie go after the park? A:cinema
Task 3.6 (yes/no questions) is a prerequisite to this task. Real-world datasets address the task of evaluating time ex-
pressions typically as a labeling, rather than a QA, task, see
3.10. Indefinite Knowledge e.g. (UzZaman et al., 2012).
This task tests if we can model statements that describe
3.15. Basic Deduction
possibilities rather than certainties:
John is either in the classroom or the playground. This task tests basic deduction via inheritance of properties:
Sandra is in the garden.
Sheep are afraid of wolves.
Is John in the classroom? A:maybe
Cats are afraid of dogs.
Is John in the office? A:no
Mice are afraid of cats.
Gertrude is a sheep.
3.11. Basic Coreference What is Gertrude afraid of? A:wolves
3.18. Reasoning about Size The actions an actor can execute in the simulation con-
sist of the following: go <location>, get <object>,
This tasks requires reasoning about relative size of objects get <object1> from <object2>, put <object1> in/on
and is inspired by the commonsense reasoning examples in <object2>, give <object> to <actor>, drop <object>,
the Winograd schema challenge (Levesque et al., 2011): set <entitity> <state>, look, inventory and examine
<object>. A set of universal constraints is imposed on
The football fits in the suitcase. those actions to enforce coherence in the simulation. For
The suitcase fits in the cupboard.
The box of chocolates is smaller than the football.
example an actor cannot get something that they or some-
Will the box of chocolates fit in the suitcase? A:yes one else already has, they cannot go to a place that is not
connected to the current location, cannot drop something
Tasks 3.3 (three supporting facts) and 3.6 (yes/no ques- they do not already have, and so on.
tions) are prerequisites to this task. Using the underlying actions, rules for actors, and their
constraints, defines how actors act. For each task we limit
3.19. Path Finding the actions needed for that task, e.g. task 3.1 only needs
go whereas task 3.2 uses go, get and drop. If we write the
In this task the goal is to find the path between locations:
commands down this gives us a very simple “story” which
The kitchen is north of the hallway. is executable by the simulation, e.g., joe go playground;
The den is east of the hallway. bob go office; joe get football. This example corresponds
How do you go from den to kitchen? A: west, north to task 3.2. The system can then ask questions about the
state of the simulation e.g., where john?, where football?
This is related to the work of (Chen & Mooney, 2011). and so on. It is easy to calculate the true answers for these
questions as we have access to the underlying world.
3.20. Reasoning about Agent’s Motivations In order to produce more natural looking text with lexical
This task tries to ask why an agent performs a certain ac- variety from statements and questions we employ a simple
tion. It addresses the case of actors being in a given state automated grammar. Each verb is assigned a set of syn-
(hungry, thirsty, tired, . . . ) and the actions they then take: onyms, e.g., the simulation command get is replaced with
either picked up, got, grabbed or took, and drop is replaced
John is hungry. with either dropped, left, discarded or put down. Similarly,
John goes to the kitchen. each object and actor can have a set of replacement syn-
John eats the apple. onyms as well, e.g. replacing Daniel with he in task 3.11.
Daniel is hungry.
Adverbs are crucial for some tasks such as the time manip-
Where does Daniel go? A:kitchen
Why did John go to the kitchen? A:hungry ulation task 3.14.
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks
There are a great many aspects of language not yet mod- the previous iteration:
eled. For example, all sentences are so far relatively short
and contain little nesting. Further, the entities and the vo- o2 = O2 (q, m) = arg max sO ([x, mo1 ], mi ) (2)
i=1,...,N
cabulary size is small (150 words, and typically 4 actors,
6 locations and 3 objects used per task). The hope is that where the candidate supporting memory mi is now scored
defining a set of well defined tasks will help evaluate mod- with respect to both the original input and the first support-
els in a controlled way within the simulated environment, ing memory, where square brackets denote a list. The final
which is hard to do with real data. These tasks are not a output o is [x, mo1 , mo2 ], which is input to the module R.
substitute for real data, but should complement them, espe-
cially when developing and analysing algorithms. Our aim Finally, R needs to produce a textual response r. While the
is to make this simulation more sophisticated and to release authors also consider Recurrent Neural Networks (RNNs),
improved versions and tasks, over time. Hopefully it can their standard setup limits responses to be a single word
then scale up to evaluate more and more useful properties. (out of all the words seen by the model) by ranking them:
Potentially, component I can make use of standard pre- They consider various extensions of their model, in particu-
processing, e.g., parsing and entity resolution, but the sim- lar modeling write time and modeling unseen words. Here
plest form is to do no processing at all. The simplest form we only discuss the former which we also use. In order
of G is store the new incoming example in an empty mem- for the model to work on QA tasks over stories it needs
ory slot, and leave the rest of the memory untouched. Thus, to know which order the sentences were uttered which is
in (Weston et al., 2014) the actual implementation used is not available in the model directly. They thus add extra
exactly this simple form, where the bulk of the work is in write time extra features to SO which take on the value 0
the O and R components. The former is responsible for or 1 indicating which sentence is older than another being
reading from memory and performing inference, e.g., cal- compared, and compare triples of pairs of sentences and
culating what are the relevant memories to answer a ques- the question itself. Training is carried out by stochastic
tion, and the latter for producing the actual wording of the gradient descent using supervision from both the question
answer given O. answer pairs and the supporting memories (to select o1 and
o2 ). See (Weston et al., 2014) for more details.
The O module produces output features by finding k sup-
porting memories given x. They use k = 2. For k = 1 the 5.1. Shortcomings of the Existing MemNNs
highest scoring supporting memory is retrieved with:
The Memory Networks models defined in (Weston et al.,
o1 = O1 (x, m) = arg max sO (x, mi ) (1) 2014) are one possible technique to try on our tasks, how-
i=1,...,N ever there are several tasks which they are likely to fail on:
where sO is a function that scores the match between the • They model sentences with a bag of words so are
pair of sentences x and mi . For the case k = 2 they then likely to fail on tasks such as the 2-argument (Sec. 3.4)
find a second supporting memory given the first found in and 3-argument (Sec. 3.5) relation problems.
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks
Table 1. Test accuracy (%) on our 20 Tasks for various methods (training with 1000 training examples on each). Our proposed extensions
to MemNNs are in columns 5-9: with adaptive memory (AM), N -grams (NG), nonlinear matching function (NL), multilinear matching
(ML), and combinations thereof. Bold numbers indicate tasks where our extensions achieve ≥ 95% accuracy but the original MemNN
model of (Weston et al., 2014) did not. The last two columns (10-11) give extra analysis of the MemNN method. Column 10 gives the
AM + NG + NL
amount of training data for each task needed to obtain ≥ 95% accuracy, or FAIL if this is not achievable with 1000 training examples.
The final column gives the accuracy when training on all data at once, rather than separately.
Weakly Strong Supervision
Supervised (using supporting facts)
95
R
Y
OR
g
EA
)
ni n
14
EM N
UL N N
EA
et a NN
ON N N
IN
EM
20
MS
i
-G N
TI V N
req
G N
L
Tra
LI N
TI L
+N
l. ,
+ N mN
RA
AP em
+ M em
Cla -gram
+ NmN
est em
r
+ Nem
TM
sk
x.
fie
fe
iTa
AD M
M
AMMe
(W M
M
on
ssi
Me
.o
lt
LS
Mu
No
N
TASK
AM
AM
AM
3.1 - Single Supporting Fact 36 50 100 100 100 100 100 100 250 ex. 100
3.2 - Two Supporting Facts 2 20 100 100 100 100 100 100 500 ex. 100
3.3 - Three Supporting Facts 7 20 20 100 99 100 99 100 500 ex. 98
3.4 - Two Arg. Relations 50 61 71 69 100 73 100 100 500 ex. 80
3.5 - Three Arg. Relations 20 70 83 83 86 86 98 98 1000 ex. 99
3.6 - Yes/No Questions 49 48 47 52 53 100 100 100 500 ex. 100
3.7 - Counting 52 49 68 78 86 83 90 85 FAIL 86
3.8 - Lists/Sets 40 45 77 90 88 94 91 91 FAIL 93
3.9 - Simple Negation 62 64 65 71 63 100 100 100 500 ex. 100
3.10 - Indefinite Knowledge 45 44 59 57 54 97 96 98 1000 ex. 98
3.11 - Basic Coreference 29 72 100 100 100 100 100 100 250 ex. 100
3.12 - Conjunction 9 74 100 100 100 100 100 100 250 ex. 100
3.13 - Compound Coreference 26 94 100 100 100 100 100 100 250 ex. 100
3.14 - Time Reasoning 19 27 99 100 99 100 99 99 500 ex. 99
3.15 - Basic Deduction 20 21 74 73 100 77 100 100 100 ex. 100
3.16 - Basic Induction 43 23 27 100 100 100 100 100 100 ex. 94
3.17 - Positional Reasoning 46 51 54 46 49 57 60 65 FAIL 72
3.18 - Size Reasoning 52 52 57 50 74 54 89 95 1000 ex. 93
3.19 - Path Finding 0 8 0 9 3 15 34 36 FAIL 19
3.20 - Agent’s Motivations 76 91 100 100 100 100 100 100 250 ex. 100
Mean Performance 34 49 75 79 83 87 93 93 100 92
is that the dictionary grows rapidly with N . We therefore however evaluation with larger n gave similar results.
consider an alternative neural network approach, which we
The summary of our experimental results on the tasks is
call a multilinear map. Each word in a sentence is binned
given in Table 1. We give results for each of the 20 tasks
into one of Psz positions with p(i, l) = ⌈(iPsz )/l)⌉ where
separately and the mean performance in the final row.
i is the position of the word in a sentence of length l, and
for each position we employ a n × n matrix Pp(i,l) . We Standard MemNNs generally outperform the N -gram and
then model the matching score with: LSTM baselines, which is consistent with the results in
X (Weston et al., 2014). However they still “fail” at a num-
s(q, d) = E(q)·E(d); E(x) = tanh( Pp(i,l) Φx (xi )⊤ U ) ber of tasks; that is, as the tasks have been built such that
i=1,...,l they are noise-free we define failure to be test accuracy less
(5)
than 95%5 . Some of these failures are expected as stated
whereby we apply a linear map for each word dependent
in Sec. 5.1, e.g. k = 2 facts, single word answers and
on its position, followed by a tanh nonlinearity on the
bag-of-words do not succeed on tasks 3.3, 3.4, 3.5, 3.7, 3.8
sum of mappings. Note that this is related to the model
and 3.18. However, there were also failures on tasks we did
of (Yu et al., 2014) who consider tags rather than positions.
not at first expect, for example yes/no questions (3.6) and
Finally, to assess the performance of nonlinear maps that indefinite knowledge (3.10). Given hindsight, we realize
do not model word position at all we also consider the fol- that the linear scoring function of standard MemNNs can-
lowing nonlinear embedding: not model the match between query, supporting fact and a
yes/no answer as this requires three-way interactions.
E(x) = tanh(W tanh(Φx (x)⊤ U )). (6) Columns 5-9 of Table 1 give the results for our MemNN
where W is a n × n matrix. This is similar to a classical extensions: adaptive memories and responses (AM) of
two-layer neural network, but applied to both sides q and d Sec. 5.2.1, and the three sentence modeling approaches of
of s(q, d). We also consider the straight-forward combina- Sec. 5.2.2: N -grams (NG), multilinear (ML) and nonlinear
tion of bag-of-N -grams followed by this nonlinearity. (NL), plus combinations thereof. The adaptive approach
gives a straight-forward improvement in tasks 3.3 and 3.16
because they both require more than two supporting facts,
6. Experiments and also gives (small) improvements in 3.8 and 3.19 be-
We compared the following methods on our set of cause they require multi-word outputs (but still remain dif-
tasks: (i) an N -gram baseline, (ii) LSTMs (long ficult). We hence use the AM model in combination with
short term memory Recurrent Neural Networks) all our other extensions in the subsequent experiments.
(Hochreiter & Schmidhuber, 1997), (iii) Memory Net- MemNNs with N -gram modeling yield clear improve-
works (MemNNs); and (iv) our extensions of Memory ments when word order matters, e.g. tasks 3.4 and 3.15.
Networks described in Section 5.2. The N -gram baseline However, N -grams do not seem to be a substitute for non-
is inspired by the baselines in (Richardson et al., 2013) linearities in the embedding function as the NL model out-
but applied to the case of producing a 1-word answer performs N -grams on average, especially in the yes/no
rather than a multiple choice question: we construct a (3.6) and indefinite tasks (3.10), as explained before. On
bag-of-N -grams for all sentences in the story that share the other hand, the NL method cannot model word or-
at least one word with the question, and then learn a der and so fails e.g., on task 3.4. The obvious step is
linear classifier to predict the answer using those features3 . thus to combine these complimentary approaches: indeed
LSTMs are a popular method for sequence prediction AM+NG+NL (column 9) gives improved results over both,
and outperform standard RNNs for similar tasks to ours with a total of 9 tasks that have been upgraded from failure
in (Weston et al., 2014). Note that they are supervised to success compared to the original MemNN model. The
by answers only, not supporting facts, and are hence at a multilinear model, as an alternative to this approach, also
disadvantage compared to MemNNs which use them4 . does similarly well and may be useful in real-world cases
For each task we use 1000 questions for training, and 1000 where N -grams cause the dictionary to be too large.
for testing. Learning rates and other hyperparameters are The final two columns (10-11) give further analysis of the
chosen using the training set. For all MemNN variants we AM+NG+NL MemNN method. The second to last column
fixed the embedding dimension to n = 50 for simplicity, (10) shows the minimum number of training examples re-
3 quired to achieve ≥ 95% accuracy, or FAIL if this is not
Constructing N -grams from all sentences rather than using
the filtered set gave worse results. achieved with 1000 examples. This is important as it is not
4
It is clearer to evaluate models in two tracks: fully and weakly only desirable to perform well on a task, but also using the
supervised. Weak supervision is ultimately desirable; full super- 5
vision gives accuracy upper bounds for “weak” models. The choice of 95% (and 1000 training examples) is arbitrary.
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks
fewest number of examples (to generalize well, quickly). Indicator variables for each pair of SRL verbs in fo1 and
Most tasks require 100-500 examples. Task 3.8 requires fo2 . (5) SRL Verb-Arg Pair: Indicator variables for each
5000 examples and 3.7 requires 10000, hence they are la- pair of SRL arguments in fo1 , fo2 and their corresponding
beled as FAIL. The latter task can presumably be solved by verbs. After finding the supporting facts, we build a similar
adding all the times an object is picked up, and subtract- structured SVM for the response stage, also with features
ing the times it is dropped, which seems possible for an tuned for that goal: Words – indicator for each word in x,
MemNN, but it does not do perfectly. Two tasks, positional Word Pairs – indicator for each pair of words in x and sup-
reasoning 3.17 and path finding 3.19 cannot be solved even porting facts, and similar SRL Verb and SRL Verb-Arg Pair
with 10000 examples, it seems those (and indeed more ad- features as before.
vanced forms of induction and deduction, which we plan
Results are given in Table 2. The structured SVM, despite
to build) require a general search algorithm to be built into
having access to external resources, does not perform bet-
the inference procedure, which MemNN are lacking.
ter than MemNNs overall, still failing at 9 tasks. It does
The last column shows the performance of AM+NG+NL perform well on tasks 3.6, 3.9 and 3.10 where the hand-
MemNNs when training on all the tasks jointly, rather than built feature conjunctions capture the necessary nonlineari-
just on a single one. The performance is generally encour- ties that the original MemNNs do not. However, it seems to
agingly similar, showing such a model can learn many as- do significantly worse on tasks requiring three (and some-
pects of text understanding and reasoning simultaneously. times, two) supporting facts (e.g. tasks 3.3, 3.16 and 3.2)
presumably as ranking over so many possibilities intro-
6.1. Baseline using External Resources duces more mistakes. However, its non-greedy search does
seem to help on other tasks, such as path finding (task 3.19)
We also built a classical cascade NLP system baseline where search is very important.
using a structured SVM, which incorporates coreference
resolution and semantic role labeling preprocessing steps, 7. Conclusion
which are themselves trained on large amounts of costly
labeled data. We first run the Stanford coreference system We developed a set of tasks that we believe are a prereq-
(Raghunathan et al., 2010) on the stories and each mention uisite to full language understanding and reasoning, which
is then replaced with the first mention of its entity class. include both training and testing data. While any learner
Second, the SENNA semantic role labeling system (SRL) that can solve these tasks is not necessarily close to solv-
(Collobert et al., 2011) is run, and we collect the set of ar- ing AI, we believe if a learner fails on any of our tasks it
guments for each verb. We then define a ranking task for exposes it is definitely not going to solve AI.
finding the supporting facts (trained using strong supervi- We also presented some models that attempt to solve these
sion): tasks. Overall, our experiments give further proof that
Memory Networks are an interesting model beyond the
o1 , o2 , o3 = arg max SO (x, fo1 , fo2 , fo3 ; Θ)
o∈O original paper. However, we also highlighted many flaws
in that model, which our proposed extensions ameliorate to
where given the question x we find at most three sup- a degree. The main issues are that the models still fail on
porting facts with indices oi from the set of facts f in several of the tasks, and use a far stronger form of supervi-
the story (we also consider selecting an “empty fact” for sion (using supporting facts) than is typically realistic.
the case of less than three), and SO is a linear scoring
function with parameters Θ. Computing the argmax re- We hope that future research will aim to minimize the
quires doing exhaustive search, unlike e.g. the MemNN amount of required supervision, as well as the number of
method which is greedy. For scalability, we thus prune training examples that has to be seen to solve a new task.
the set of possible matches by requiring that facts share For example, it seems that humans are able to generalize
one common non-determiner word with each other match to new tasks after seeing only couple of dozen of exam-
or with x. SO is constructed as a set of indicator fea- ples, without having any additional supervision signal. Fur-
tures. For simplicity each of the features only looks ther, our hope is that a feedback loop of developing more
at pairs of sentences, i.e. SO (x, fo1 , fo2 , fo3 ; Θ) = challenging tasks, and then algorithms that can solve them,
Θ∗(g(x, fo1 ), g(x, fo2 ), g(x, fo3 ), g(fo1 , fo2 ), g(fo2 , fo3 ), leads us in a fruitful research direction.
g(fo1 , fo3 )). The feature function g is made up of the fol-
lowing feature types, shown here for g(fo1 , fo2 ): (1) Word References
pairs: One indicator variable for each pair of words in fo1
Bache, K. and Lichman, M. UCI ma-
and fo2 . (2) Pair distance: Indicator for the distance be-
chine learning repository, 2013. URL
tween the sentence, i.e. o1 −o2 . (3) Pair order: Indicator for
http://archive.ics.uci.edu/ml.
the order of the sentence, i.e. o1 > o2 . (4) SRL Verb Pair:
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks
Table 2. Test accuracy (%) on our 20 Tasks for the baseline of Section 6.1 that uses external resources, comparing to various methods
from Table 1.
Weakly Strong Supervision Uses External
Supervised (using supporting facts) Resources
eat M
s
L f SV
ure
)
14
et a NN
20
F+ ed
L
G N+N
l.,
Cla -gram
SR
+ NmN
RE tur
est em
r
TM
fie
COruc
(W M
on
ssi
Me
LS
N
TASK
St
AM
3.1 - Single Supporting Fact 36 50 100 100 99
3.2 - Two Supporting Facts 2 20 100 100 74
3.3 - Three Supporting Facts 7 20 20 100 17
3.4 - Two Arg. Relations 50 61 71 100 98
3.5 - Three Arg. Relations 20 70 83 98 83
3.6 - Yes/No Questions 49 48 47 100 99
3.7 - Counting 52 49 68 85 69
3.8 - Lists/Sets 40 45 77 91 70
3.9 - Simple Negation 62 64 65 100 100
3.10 - Indefinite Knowledge 45 44 59 98 99
3.11 - Basic Coreference 29 72 100 100 100
3.12 - Conjunction 9 74 100 100 96
3.13 - Compound Coreference 26 94 100 100 99
3.14 - Time Reasoning 19 27 99 99 99
3.15 - Basic Deduction 20 21 74 100 96
3.16 - Basic Induction 43 23 27 100 24
3.17 - Positional Reasoning 46 51 54 65 61
3.18 - Size Reasoning 52 52 57 95 62
3.19 - Path Finding 0 8 0 36 49
3.20 - Agent’s Motivations 76 91 100 100 95
Mean Performance 34 49 75 93 79
Berant, Jonathan, Chou, Andrew, Frostig, Roy, and Liang, Fader, Anthony, Zettlemoyer, Luke, and Etzioni, Oren.
Percy. Semantic parsing on freebase from question- Open question answering over curated and extracted
answer pairs. In EMNLP, pp. 1533–1544, 2013. knowledge bases. In Proceedings of the 20th ACM
SIGKDD international conference on Knowledge dis-
Berant, Jonathan, Srikumar, Vivek, Chen, Pei-Chun, covery and data mining, pp. 1156–1165. ACM, 2014.
Huang, Brad, Manning, Christopher D, Vander Linden,
Abby, Harding, Brittany, and Clark, Peter. Modeling bi- Graves, Alex, Wayne, Greg, and Danihelka, Ivo. Neural
ological processes for reading comprehension. In Proc. turing machines. arXiv preprint arXiv:1410.5401, 2014.
EMNLP, 2014.
Haghighi, Aria and Klein, Dan. Simple coreference reso-
Bordes, Antoine, Usunier, Nicolas, Collobert, Ronan, and lution with rich syntactic and semantic features. In Pro-
Weston, Jason. Towards understanding situated natural ceedings of the 2009 Conference on Empirical Methods
language. In AISTATS, 2010. in Natural Language Processing: Volume 3-Volume 3,
Chen, David L and Mooney, Raymond J. Learning to in- pp. 1152–1161. Association for Computational Linguis-
terpret natural language navigation instructions from ob- tics, 2009.
servations. San Francisco, CA, pp. 859–865, 2011.
Halevy, Alon, Norvig, Peter, and Pereira, Fernando. The
Collobert, Ronan, Weston, Jason, Bottou, Léon, Karlen, unreasonable effectiveness of data. Intelligent Systems,
Michael, Kavukcuoglu, Koray, and Kuksa, Pavel. Nat- IEEE, 24(2):8–12, 2009.
ural language processing (almost) from scratch. The
Journal of Machine Learning Research, 12:2493–2537, Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-
2011. term memory. Neural computation, 9(8):1735–1780,
1997.
Fader, Anthony, Zettlemoyer, Luke, and Etzioni, Oren.
Paraphrase-driven learning for open question answering. Levesque, Hector J, Davis, Ernest, and Morgenstern,
In ACL, pp. 1608–1618, 2013. Leora. The winograd schema challenge. In AAAI Spring
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks
Symposium: Logical Formalizations of Commonsense Yao, Xuchen, Berant, Jonathan, and Van Durme, Benjamin.
Reasoning, 2011. Freebase qa: Information extraction or semantic pars-
ing? ACL 2014, pp. 82, 2014.
Liang, Percy. Lambda dependency-based compositional
semantics. arXiv preprint arXiv:1309.4408, 2013. Yu, Mo, Gormley, Matthew R, and Dredze, Mark. Factor-
based compositional embedding models. NIPS 2014
Liang, Percy, Jordan, Michael I, and Klein, Dan. Learning workshop on Learning Semantics, 2014.
dependency-based compositional semantics. Computa-
tional Linguistics, 39(2):389–446, 2013. Zhu, Xiaojin, Ghahramani, Zoubin, Lafferty, John, et al.
Semi-supervised learning using gaussian fields and har-
Minsky, Marvin and Papert, Seymour. Perceptron: an in- monic functions. In ICML, volume 3, pp. 912–919,
troduction to computational geometry. The MIT Press, 2003.
Cambridge, expanded edition, 19:88, 1969.