0% found this document useful (0 votes)

67 views11 pages

Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks

This document introduces a set of synthetic question answering tasks designed to evaluate progress towards building intelligent dialogue agents. The tasks are based on a simulated world and test a system's ability to answer questions that require chaining facts, simple induction, deduction and other reasoning skills. Benchmark results on standard methods show some tasks can be solved while others remain challenging open problems. The tasks are meant to expose weaknesses in current models and motivate the development of new algorithms.

Uploaded by

Masun Nabhan Homsi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views11 pages

Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks

Uploaded by

Masun Nabhan Homsi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks

Jason Weston JASE @ FB . COM

Antoine Bordes ABORDES @ FB . COM
Sumit Chopra SPCHOPRA @ FB . COM
Tomas Mikolov TMIKOLOV @ FB . COM
Alexander M. Rush SASHAR @ FB . COM
Facebook AI Research, 770 Broadway, New York, USA.
arXiv:1502.05698v6 [cs.AI] 2 Jun 2015

Abstract of Weston et al. (2014), the latter of which is relevant to this

work.
One long-term goal of machine learning research
is to produce methods that are applicable to rea- One of the reasons for the interest in synthetic data is that it
soning and natural language, in particular build- can be easier to develop new techniques using it. It is well
ing an intelligent dialogue agent. To measure known that working with large amounts of real data (“big
progress towards that goal, we argue for the use- data”) tends to lead researchers to simpler models as “sim-
fulness of a set of proxy tasks that evaluate read- ple models and a lot of data trump more elaborate mod-
ing comprehension via question answering. Our els based on less data” (Halevy et al., 2009). For example,
tasks measure understanding in several ways: N -grams for language modeling work well relative to ex-
whether a system is able to answer questions via isting competing methods, but are far from being a model
chaining facts, simple induction, deduction and that truly understands text. As researchers we can become
many more. The tasks are designed to be pre- stuck in local minima in algorithm space; development of
requisites for any system that aims to be capable synthetic data is one way to try and break out of that.
of conversing with a human. We believe many In this work we propose a framework and a set of synthetic
existing learning systems can currently not solve tasks for the goal of helping to develop learning algorithms
them, and hence our aim is to classify these tasks for text understanding and reasoning. While it is relatively
into skill sets, so that researchers can identify difficult to automatically evaluate the performance of an
(and then rectify) the failings of their systems. agent in general dialogue – a long term-goal of AI – it is
We also extend and improve the recently intro- relatively easy to evaluate responses to input questions, i.e.,
duced Memory Networks model, and show it is the task of question answering (QA). Question answering is
able to solve some, but not all, of the tasks. incredibly broad: more or less any task one can think of can
be cast into this setup. This enables us to propose a wide
ranging set of different tasks, that test different capabilities
1. Introduction of learning algorithms, under a common framework.

There is a rich history of the use of synthetic tasks Our tasks are built with a unified underlying simulation
in machine learning, from the XOR problem which of a physical world, akin to a classic text adventure game
helped motivate neural networks (Minsky & Papert, 1969; (Montfort, 2005) whereby actors move around manipulat-
Rumelhart et al., 1985), to circle, spiral and ring datasets ing objects and interacting with each other. As the simula-
that helped motivate some of the most well-known clus- tion runs, grounded text and question answer pairs are si-
tering and semi-supervised learning algorithms (Ng et al., multaneously generated. Our goal is to categorize different
2002; Zhu et al., 2003), Mackey Glass equations for time kinds of questions into skill sets, which become our tasks.
series (Müller et al., 1997), and so on – in fact some of Our hope is that the analysis of performance on these tasks
the well known UCI datasets (Bache & Lichman, 2013) will help expose weaknesses of current models and help
are synthetic as well (e.g., waveform). Recent work con- motivate new algorithm designs that alleviate these weak-
tinues this trend. For example, in the area of developing nesses. We further envision this as a feedback loop where
learning algorithms with a memory component synthetic new tasks can then be designed in response, perhaps in an
datasets were used to help develop both the Neural Turing adversarial fashion, in order to break the new models.
Machine of Graves et al. (2014) and the Memory Networks
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks

The tasks we design are detailed in Section 3, and the simu- Based on these observations, we chose to conceive a collec-
lation used to generate them in Section 4. In Sec. 6 we give tion of much simpler QA tasks, with the main objective that
benchmark results of standard methods on our tasks, and failure or success of a system on any of them can unequiv-
analyse their successes and failures. In order to exemplify ocally provide feedback on its capabilities. In that, we are
the kind of feedback loop between algorithm development close to the Winograd Schema Challenge (Levesque et al.,
and task development we envision, in Section 5 we pro- 2011), which is organized around simple statements fol-
pose a set of improvements to the recent Memory Network lowed by a single binary choice question such as: “Joan
method, which has shown to give promising performance made sure to thank Susan for all the help she had received.
in QA. We show our proposed approach does indeed give Who had received the help? Joan or Susan?”. In this chal-
improved performance on some tasks, but is still unable to lenge, and our tasks, it is straightforward to interpret re-
solve some of them, which we consider as open problems. sults. Yet, where the Winograd Challenge is mostly cen-
tered around evaluating if systems can acquire and make
2. Related Work use of background knowledge that is not expressed in the
words of the statement, our tasks are self-contained and are
Several projects targeting language understanding using more diverse. By self-contained we mean our tasks come
QA-based strategies have recently emerged. Unlike tasks with both training data and evaluation data, rather than just
like dialogue or summarization, QA is easy to evaluate the latter as in the case of ARISTO and the Winograd Chal-
(especially in true/false or multiple choice scenarios) and lenge. MCTest has a train/test split but the training set is
hence makes it an appealing research avenue. The difficulty likely too small to capture all the reasoning needed to do
lies in the definition of questions: they must be unambigu- well on the test set. In our setup one can assess the amount
ously answerable by adult humans (or children), but still of training examples needed to perform well (which can
require some thinking. The Allen Institute for AI’s flagship be increased as desired) and commonsense knowledge and
project ARISTO1 is organized around a collection of QA reasoning required for the test set should be contained in
tasks derived from increasingly difficult science exams, at the training set. In terms of diversity, some of our tasks
the 4th, 8th, and 12th grade levels. Richardson et al. (2013) are related to existing setups but we also propose many ad-
proposed the MCTest2 a set of 660 stories and associated ditional ones; tasks 3.8 and 3.9 are inspired by previous
questions intended for research on the machine compre- work on lambda dependency-based compositional seman-
hension of text. Each question requires the reader to un- tics (Liang et al., 2013; Liang, 2013) for instance. For us,
derstand different aspects of the story. each task checks one skill that the system must have and
we postulate that performing well on all of them is a pre-
These two initiatives go in a promising direction but in-
requisite for any system aiming at full text understanding
terpreting the results on these benchmarks remain compli-
and reasoning.
cated. Indeed, no system has yet been able to fully solve
the proposed tasks and since many sub-tasks need to be
solved to answer any of their questions (coreference, de- 3. The Tasks
duction, use of common-sense, etc.), it is difficult to clearly
Our main idea is to provide a set of tasks, in a similar way
identify capabilities and limitations of these systems and
to how software testing is built in computer science. Ideally
hence to propose improvements and modifications. As a
each task is a “leaf” test case, as independent from others
result, conclusions drawn from these projects are not much
as possible, and tests in the simplest way possible one as-
clearer than that coming from more traditional works on
pect of intended behavior. Subsequent (“non-leaf”) tests
QA over large-scale Knowledge Bases (Berant et al., 2013;
can build on these by testing combinations as well. The
Fader et al., 2014). Besides, the best performing systems
tasks are publicly available at http://fb.ai/babi.
are based on hand-crafted patterns and features, and/or
statistics acquired on very large corpora. It is difficult Each task provides a set of training and test data, with
to argue that such systems actually understand language the intention that a successful model performs well on test
and are not simply light upgrades of traditional informa- data. Following (Weston et al., 2014), the supervision in
tion extraction methods (Yao et al., 2014). The system of the training set is given by the true answers to questions,
(Berant et al., 2014) is more evolved since it builds a struc- and the set of relevant statements for answering a given
tured representation of a text and of a question to answer. question, which may or may not be used by the learner. We
Despite its potential this method remains highly domain set up the tasks so that correct answers are limited to a sin-
specific and relies on a lot of prior knowledge. gle word (Q: Where is Mark? A: bathroom), or else a list
1
of words (Q: What is Mark holding?) as evaluation is then
http://allenai.org/aristo.html
2 clear-cut, and is measured simply as right or wrong.
http://research.microsoft.com/mct

All of the tasks are noiseless and a human can potentially

Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks

achieve 100% accuracy. We tried to choose tasks that are Manoj gendh le kar aaya.
natural to a human reader, and no background in areas such Manoj gusalkhaney mein chala gaya.
as formal semantics, machine learning, logic or knowledge Gendh is samay kahan hai? A: gusalkhana
representation is required for an adult to solve them. Manoj daftar gaya.
Priya bagichey gayi.
The data itself is produced using a simple simulation of Gendh ab kahan hai? A: daftar
characters and objects moving around and interacting in lo-
cations, described in Section 4. The simulation allows us to
3.3. Factoid QA with Three Supporting Facts
generate data in many different scenarious where the true
labels are known by grounding to the simulation. For each Similarly, one can make a task with three supporting facts:
task, we describe it by giving a small sample of the dataset
John picked up the apple.
including statements, questions and the true labels (in red). John went to the office.
John went to the kitchen.
3.1. Basic Factoid QA with Single Supporting Fact John dropped the apple.
Where was the apple before the kitchen? A:office
Our first task consists of questions where a single support-
ing fact that has been previously given provides the answer. The first three statements are all required to answer this.
We first test one of the simplest cases of this, by asking for
the location of a person. A small sample of the task is thus: 3.4. Two Argument Relations: Subject vs. Object

John is in the playground. To answer questions the ability to differentiate and recog-
Bob is in the office. nize subjects and objects is crucial. We consider here the
Where is John? A:playground extreme case where sentences feature re-ordered words, i.e.
a bag-of-words will not work:
This kind of synthetic data was already used in
(Weston et al., 2014). It can be considered the simplest The office is north of the bedroom.
case of some real world QA datasets such as in (Fader et al., The bedroom is north of the bathroom.
2013). What is north of the bedroom? A: office
What is the bedroom north of? A: bathroom
3.2. Factoid QA with Two Supporting Facts Note that the two questions above have exactly the same
A harder task is to answer questions where two supporting words, but in a different order, and different answers.
statements have to be chained to answer the question:
3.5. Three Argument Relations
John is in the playground.
Bob is in the office. Similarly, sometimes one needs to differentiate three sepa-
John picked up the football. rate arguments, such as in the following task:
Bob went to the kitchen.
Where is the football? A:playground Mary gave the cake to Fred.
Fred gave the cake to Bill.
For example, to answer the question Where is the football?, Jeff was given the milk by Bill.
both John picked up the football and John is in the play- Who gave the cake to Fred? A: Mary
Who did Fred give the cake to? A: Bill
ground are supporting facts. Again, this kind of task was What did Jeff receive? A: milk
already used in (Weston et al., 2014). Who gave the milk? A: Bill
Note that, to show the difficulty of these tasks for a learning The last question is potentially the hardest for a learner as
machine with no other knowledge we can shuffle the letters the first two can be answered by providing the actor that is
of the alphabet and produce equivalent datasets: not mentioned in the question.
Sbdm ip im vdu yonrckblms.
Abf ip im vdu bhhigu. 3.6. Yes/No questions
Sbdm yigaus ly vdu hbbvfnoo.
Abf zumv vb vdu aivgdum. This task tests, on some of the simplest questions possible
Mduku ip vdu hbbvfnoo? A:yonrckblms (specifically, ones with a single supporting fact) the ability
of a model to answer true/false type questions:
We can also use the simulation to generate other languages
other than English. We thus produced the same set of tasks John is in the playground.
Daniel picks up the milk.
in Hindi, e.g. for this task: Is John in the classroom? A:no
Does Daniel have the milk? A:yes
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks

3.7. Counting
Sandra was in the office.
This task tests the ability of the QA system to perform sim- Where is Daniel? A:studio
ple counting operations, by asking about the number of ob-
jects with a certain property: Real-world data typically addresses this as a label-
ing problem and studies more sophisticated phenomena
Daniel picked up the football.
(Haghighi & Klein, 2009). A RISTO also addresses this
Daniel dropped the football.
Daniel got the milk. task.
Daniel took the apple.
How many objects is Daniel holding? A: two 3.12. Conjunction
This task tests referring to multiple subjects in a single
3.8. Lists / Sets statement, for example:
While many of our tasks are designed to have single word Mary and Jeff went to the kitchen.
answers for simplicity, this set of tasks tests the ability to Then Jeff went to the park.
produce a set of single word answers in the form of a list, Where is Mary? A: kitchen
by asking about sets of entities with certain properties, e.g.:
Daniel picks up the football. 3.13. Compound Coreference
Daniel drops the newspaper.
Daniel picks up the milk. This task tests coreference in the case where the pronoun
What is Daniel holding? milk, football can refer to multiple actors, for example:
The task above can be seen as a QA task related to a Daniel and Sandra journeyed to the office.
database search operation. Note that we could also con- Then they went to the garden.
sider the following question types: intersection (Who is in Sandra and John travelled to the kitchen.
After that they moved to the hallway.
the park carrying food?), union (Who has milk or cook- Where is Daniel? A: garden
ies?) and set difference (Who is in the park apart from
Bill?). However, we leave those for future work.
3.14. Time Manipulation
3.9. Simple Negation
While our tasks so far have included time implicitly in the
We test one of the simplest types of negation, that of sup- order of the statements, this task tests understanding the
porting facts that imply a statement is false: use of time expressions within the statements, for example:
Sandra travelled to the office. In the afternoon Julie went to the park.
Fred is no longer in the office. Yesterday Julie was at school.
Is Fred in the office? A:no Julie went to the cinema this evening.
Is Sandra in the office? A:yes Where did Julie go after the park? A:cinema

Task 3.6 (yes/no questions) is a prerequisite to this task. Real-world datasets address the task of evaluating time ex-
pressions typically as a labeling, rather than a QA, task, see
3.10. Indefinite Knowledge e.g. (UzZaman et al., 2012).
This task tests if we can model statements that describe
3.15. Basic Deduction
possibilities rather than certainties:
John is either in the classroom or the playground. This task tests basic deduction via inheritance of properties:
Sandra is in the garden.
Sheep are afraid of wolves.
Is John in the classroom? A:maybe
Cats are afraid of dogs.
Is John in the office? A:no
Mice are afraid of cats.
Gertrude is a sheep.
3.11. Basic Coreference What is Gertrude afraid of? A:wolves

This task tests the simplest type of coreference, that of de-

3.16. Basic Induction
tecting the nearest referent, for example:
Daniel was in the kitchen.
This task tests basic induction via potential inheritance of
Then he went to the studio. properties, for example:
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks

Lily is a swan. 4. Simulation

Lily is white.
Greg is a swan.
All our tasks are generated with a simulation which be-
What color is Greg? A:white haves like a classic text adventure game. The idea is that
generating text within this simulation allows us to ground
Clearly, a full analysis of induction is beyond the scope of the language used into a coherent and controlled (artifi-
this work. An answer produced using induction may not be cial) world. Our simulation follows those of (Bordes et al.,
true, which we can control in our simulation. 2010; Weston et al., 2014) but is somewhat more complex.
The simulated world is composed of entities of various
3.17. Positional Reasoning types (locations, objects, persons. etc.) and of various ac-
This task tests spatial reasoning, one of many components tions that operate on these entities. Entities have internal
of the classical SHRDLU system (Winograd, 1972): states: their location, if they carry objects on top or inside
them (for example tables and boxes), the mental state of ac-
The triangle is to the right of the blue square. tors (e.g. hungry), as well as properties such as size, color,
The red square is on top of the blue square. and edibility. For locations, the nearby places that are con-
The red sphere is to the right of the blue square. nected (e.g. what lies to the east, or above) are encoded.
Is the red sphere to the right of the blue square? A:yes For actors a set of pre-specified rules per actor can also be
Is the red square to the left of the triangle? A:yes specified to control their behavior, e.g. if they are hungry
they may try to find food. Random valid actions can also
Task 3.6 (yes/no questions) is a prerequisite to this task. be executed if no rule is set, e.g. walking around randomly.

3.18. Reasoning about Size The actions an actor can execute in the simulation con-
sist of the following: go <location>, get <object>,
This tasks requires reasoning about relative size of objects get <object1> from <object2>, put <object1> in/on
and is inspired by the commonsense reasoning examples in <object2>, give <object> to <actor>, drop <object>,
the Winograd schema challenge (Levesque et al., 2011): set <entitity> <state>, look, inventory and examine
<object>. A set of universal constraints is imposed on
The football fits in the suitcase. those actions to enforce coherence in the simulation. For
The suitcase fits in the cupboard.
The box of chocolates is smaller than the football.
example an actor cannot get something that they or some-
Will the box of chocolates fit in the suitcase? A:yes one else already has, they cannot go to a place that is not
connected to the current location, cannot drop something
Tasks 3.3 (three supporting facts) and 3.6 (yes/no ques- they do not already have, and so on.
tions) are prerequisites to this task. Using the underlying actions, rules for actors, and their
constraints, defines how actors act. For each task we limit
3.19. Path Finding the actions needed for that task, e.g. task 3.1 only needs
go whereas task 3.2 uses go, get and drop. If we write the
In this task the goal is to find the path between locations:
commands down this gives us a very simple “story” which
The kitchen is north of the hallway. is executable by the simulation, e.g., joe go playground;
The den is east of the hallway. bob go office; joe get football. This example corresponds
How do you go from den to kitchen? A: west, north to task 3.2. The system can then ask questions about the
state of the simulation e.g., where john?, where football?
This is related to the work of (Chen & Mooney, 2011). and so on. It is easy to calculate the true answers for these
questions as we have access to the underlying world.
3.20. Reasoning about Agent’s Motivations In order to produce more natural looking text with lexical
This task tries to ask why an agent performs a certain ac- variety from statements and questions we employ a simple
tion. It addresses the case of actors being in a given state automated grammar. Each verb is assigned a set of syn-
(hungry, thirsty, tired, . . . ) and the actions they then take: onyms, e.g., the simulation command get is replaced with
either picked up, got, grabbed or took, and drop is replaced
John is hungry. with either dropped, left, discarded or put down. Similarly,
John goes to the kitchen. each object and actor can have a set of replacement syn-
John eats the apple. onyms as well, e.g. replacing Daniel with he in task 3.11.
Daniel is hungry.
Adverbs are crucial for some tasks such as the time manip-
Where does Daniel go? A:kitchen
Why did John go to the kitchen? A:hungry ulation task 3.14.
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks

There are a great many aspects of language not yet mod- the previous iteration:
eled. For example, all sentences are so far relatively short
and contain little nesting. Further, the entities and the vo- o2 = O2 (q, m) = arg max sO ([x, mo1 ], mi ) (2)
i=1,...,N
cabulary size is small (150 words, and typically 4 actors,
6 locations and 3 objects used per task). The hope is that where the candidate supporting memory mi is now scored
defining a set of well defined tasks will help evaluate mod- with respect to both the original input and the first support-
els in a controlled way within the simulated environment, ing memory, where square brackets denote a list. The final
which is hard to do with real data. These tasks are not a output o is [x, mo1 , mo2 ], which is input to the module R.
substitute for real data, but should complement them, espe-
cially when developing and analysing algorithms. Our aim Finally, R needs to produce a textual response r. While the
is to make this simulation more sophisticated and to release authors also consider Recurrent Neural Networks (RNNs),
improved versions and tasks, over time. Hopefully it can their standard setup limits responses to be a single word
then scale up to evaluate more and more useful properties. (out of all the words seen by the model) by ranking them:

r = R(q, w) = argmaxw∈W sR ([x, mo1 , mo2 ], w) (3)

5. Memory Networks
where W is the set of all words in the dictionary, and sR is
Memory Networks (Weston et al., 2014) are a promising
a function that scores the match.
class of models, shown to perform well at QA, that we can
apply to our tasks. They consist of a memory m (an array The scoring functions sO and sR have the same form, that
of objects indexed by mi ) and four potentially learnable of an embedding model:
components I, G, O and R that are executed given an input:
s(x, y) = Φx (x)⊤ U ⊤ U Φy (y). (4)
I: (input feature map) – convert input sentence x to an
internal feature representation I(x). where U is a n × D matrix where D is the number of fea-
tures and n is the embedding dimension. The role of Φx
G: (generalization) – update the current memory state m and Φy is to map the original text to the D-dimensional
given the new input: mi = G(mi , I(x), m), ∀i. feature space. They choose a bag of words representation,
O: (output feature map) – compute output o given the new and D = 3|W | for sO , i.e., every word in the dictionary
input and the memory: o = O(I(x), m). has three different representations: one for Φy (.) and two
for Φx (.) depending on whether the words of the input ar-
R: (response) – finally, decode output features o to give guments are from the actual input x or from the supporting
the final textual response to the user: r = R(o). memories so that they can be modeled differently.

Potentially, component I can make use of standard pre- They consider various extensions of their model, in particu-
processing, e.g., parsing and entity resolution, but the sim- lar modeling write time and modeling unseen words. Here
plest form is to do no processing at all. The simplest form we only discuss the former which we also use. In order
of G is store the new incoming example in an empty mem- for the model to work on QA tasks over stories it needs
ory slot, and leave the rest of the memory untouched. Thus, to know which order the sentences were uttered which is
in (Weston et al., 2014) the actual implementation used is not available in the model directly. They thus add extra
exactly this simple form, where the bulk of the work is in write time extra features to SO which take on the value 0
the O and R components. The former is responsible for or 1 indicating which sentence is older than another being
reading from memory and performing inference, e.g., cal- compared, and compare triples of pairs of sentences and
culating what are the relevant memories to answer a ques- the question itself. Training is carried out by stochastic
tion, and the latter for producing the actual wording of the gradient descent using supervision from both the question
answer given O. answer pairs and the supporting memories (to select o1 and
o2 ). See (Weston et al., 2014) for more details.
The O module produces output features by finding k sup-
porting memories given x. They use k = 2. For k = 1 the 5.1. Shortcomings of the Existing MemNNs
highest scoring supporting memory is retrieved with:
The Memory Networks models defined in (Weston et al.,
o1 = O1 (x, m) = arg max sO (x, mi ) (1) 2014) are one possible technique to try on our tasks, how-
i=1,...,N ever there are several tasks which they are likely to fail on:
where sO is a function that scores the match between the • They model sentences with a bag of words so are
pair of sentences x and mi . For the case k = 2 they then likely to fail on tasks such as the 2-argument (Sec. 3.4)
find a second supporting memory given the first found in and 3-argument (Sec. 3.5) relation problems.
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks

Table 1. Test accuracy (%) on our 20 Tasks for various methods (training with 1000 training examples on each). Our proposed extensions
to MemNNs are in columns 5-9: with adaptive memory (AM), N -grams (NG), nonlinear matching function (NL), multilinear matching
(ML), and combinations thereof. Bold numbers indicate tasks where our extensions achieve ≥ 95% accuracy but the original MemNN
model of (Weston et al., 2014) did not. The last two columns (10-11) give extra analysis of the MemNN method. Column 10 gives the
AM + NG + NL
amount of training data for each task needed to obtain ≥ 95% accuracy, or FAIL if this is not achievable with 1000 training examples.
The final column gives the accuracy when training on all data at once, rather than separately.
Weakly Strong Supervision
Supervised (using supporting facts)

95
R
Y
OR

g
EA
)

ni n
14

EM N

UL N N
EA
et a NN

ON N N

IN
EM
20

i
-G N
TI V N

req
G N
L

Tra
LI N

TI L

+N
l. ,

+ N mN
RA
AP em

+ M em
Cla -gram

+ NmN
est em
r

+ Nem
TM

sk
x.
fie

iTa
AD M

M
AMMe
(W M

M
on
ssi

lt
LS

Mu
No
N

TASK

AM
3.1 - Single Supporting Fact 36 50 100 100 100 100 100 100 250 ex. 100
3.2 - Two Supporting Facts 2 20 100 100 100 100 100 100 500 ex. 100
3.3 - Three Supporting Facts 7 20 20 100 99 100 99 100 500 ex. 98
3.4 - Two Arg. Relations 50 61 71 69 100 73 100 100 500 ex. 80
3.5 - Three Arg. Relations 20 70 83 83 86 86 98 98 1000 ex. 99
3.6 - Yes/No Questions 49 48 47 52 53 100 100 100 500 ex. 100
3.7 - Counting 52 49 68 78 86 83 90 85 FAIL 86
3.8 - Lists/Sets 40 45 77 90 88 94 91 91 FAIL 93
3.9 - Simple Negation 62 64 65 71 63 100 100 100 500 ex. 100
3.10 - Indefinite Knowledge 45 44 59 57 54 97 96 98 1000 ex. 98
3.11 - Basic Coreference 29 72 100 100 100 100 100 100 250 ex. 100
3.12 - Conjunction 9 74 100 100 100 100 100 100 250 ex. 100
3.13 - Compound Coreference 26 94 100 100 100 100 100 100 250 ex. 100
3.14 - Time Reasoning 19 27 99 100 99 100 99 99 500 ex. 99
3.15 - Basic Deduction 20 21 74 73 100 77 100 100 100 ex. 100
3.16 - Basic Induction 43 23 27 100 100 100 100 100 100 ex. 94
3.17 - Positional Reasoning 46 51 54 46 49 57 60 65 FAIL 72
3.18 - Size Reasoning 52 52 57 50 74 54 89 95 1000 ex. 93
3.19 - Path Finding 0 8 0 9 3 15 34 36 FAIL 19
3.20 - Agent’s Motivations 76 91 100 100 100 100 100 100 250 ex. 100
Mean Performance 34 49 75 79 83 87 93 93 100 92

• They perform only two max operations (k = 2) so end while

they cannot handle questions involving more than two
supporting facts such as tasks 3.3 and 3.7. That is, we keep predicting supporting facts i, condition-
ing at each step on the previously found facts, until m∅ is
• Unless a RNN is employed in the R module, they are predicted at which point we stop. m∅ has its own unique
unable to provide multiple answers in the standard set- embedding vector, which is also learned. In practice we
ting using eq. (3). This is required for the list (3.8) and still impose a hard maximum number of loops in our ex-
path finding (3.19) tasks. periments to avoid fail cases where the computation never
We therefore propose improvements to their model in the stops (in our experiments we use a limit of 10).
following section.
Multiple Answers We use a similar trick for the response
5.2. Improving Memory Networks module as well in order to output multiple words. That
is, we add a special word w∅ to the dictionary and pre-
5.2.1. A DAPTIVE M EMORIES ( AND R ESPONSES ) dict word wi on each iteration i conditional on the previous
We consider a variable number of supporting facts that words, i.e., wi = R([x, mo1 , . . . , m|o| , wi , . . . , wi−1 ], w),
is automatically adapted dependent on the question being until we predict w∅ .
asked. To do this we consider scoring a special fact m∅ .
Computation of supporting memories then becomes: 5.2.2. N ONLINEAR S ENTENCE M ODELING
i=1 There are several ways of modeling sentences that go be-
oi = O(x, m)
while oi 6= m∅ do yond a bag-of-words, and we explore three variants here.
i←i+1 The simplest is a bag-of-N -grams, we consider N = 1, 2
oi = O([x, mo1 , . . . , moi−1 ], m) and 3 in the bag. The main disadvantage of such a method
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks

is that the dictionary grows rapidly with N . We therefore however evaluation with larger n gave similar results.
consider an alternative neural network approach, which we
The summary of our experimental results on the tasks is
call a multilinear map. Each word in a sentence is binned
given in Table 1. We give results for each of the 20 tasks
into one of Psz positions with p(i, l) = ⌈(iPsz )/l)⌉ where
separately and the mean performance in the final row.
i is the position of the word in a sentence of length l, and
for each position we employ a n × n matrix Pp(i,l) . We Standard MemNNs generally outperform the N -gram and
then model the matching score with: LSTM baselines, which is consistent with the results in
X (Weston et al., 2014). However they still “fail” at a num-
s(q, d) = E(q)·E(d); E(x) = tanh( Pp(i,l) Φx (xi )⊤ U ) ber of tasks; that is, as the tasks have been built such that
i=1,...,l they are noise-free we define failure to be test accuracy less
(5)
than 95%5 . Some of these failures are expected as stated
whereby we apply a linear map for each word dependent
in Sec. 5.1, e.g. k = 2 facts, single word answers and
on its position, followed by a tanh nonlinearity on the
bag-of-words do not succeed on tasks 3.3, 3.4, 3.5, 3.7, 3.8
sum of mappings. Note that this is related to the model
and 3.18. However, there were also failures on tasks we did
of (Yu et al., 2014) who consider tags rather than positions.
not at first expect, for example yes/no questions (3.6) and
Finally, to assess the performance of nonlinear maps that indefinite knowledge (3.10). Given hindsight, we realize
do not model word position at all we also consider the fol- that the linear scoring function of standard MemNNs can-
lowing nonlinear embedding: not model the match between query, supporting fact and a
yes/no answer as this requires three-way interactions.
E(x) = tanh(W tanh(Φx (x)⊤ U )). (6) Columns 5-9 of Table 1 give the results for our MemNN
where W is a n × n matrix. This is similar to a classical extensions: adaptive memories and responses (AM) of
two-layer neural network, but applied to both sides q and d Sec. 5.2.1, and the three sentence modeling approaches of
of s(q, d). We also consider the straight-forward combina- Sec. 5.2.2: N -grams (NG), multilinear (ML) and nonlinear
tion of bag-of-N -grams followed by this nonlinearity. (NL), plus combinations thereof. The adaptive approach
gives a straight-forward improvement in tasks 3.3 and 3.16
because they both require more than two supporting facts,
6. Experiments and also gives (small) improvements in 3.8 and 3.19 be-
We compared the following methods on our set of cause they require multi-word outputs (but still remain dif-
tasks: (i) an N -gram baseline, (ii) LSTMs (long ficult). We hence use the AM model in combination with
short term memory Recurrent Neural Networks) all our other extensions in the subsequent experiments.
(Hochreiter & Schmidhuber, 1997), (iii) Memory Net- MemNNs with N -gram modeling yield clear improve-
works (MemNNs); and (iv) our extensions of Memory ments when word order matters, e.g. tasks 3.4 and 3.15.
Networks described in Section 5.2. The N -gram baseline However, N -grams do not seem to be a substitute for non-
is inspired by the baselines in (Richardson et al., 2013) linearities in the embedding function as the NL model out-
but applied to the case of producing a 1-word answer performs N -grams on average, especially in the yes/no
rather than a multiple choice question: we construct a (3.6) and indefinite tasks (3.10), as explained before. On
bag-of-N -grams for all sentences in the story that share the other hand, the NL method cannot model word or-
at least one word with the question, and then learn a der and so fails e.g., on task 3.4. The obvious step is
linear classifier to predict the answer using those features3 . thus to combine these complimentary approaches: indeed
LSTMs are a popular method for sequence prediction AM+NG+NL (column 9) gives improved results over both,
and outperform standard RNNs for similar tasks to ours with a total of 9 tasks that have been upgraded from failure
in (Weston et al., 2014). Note that they are supervised to success compared to the original MemNN model. The
by answers only, not supporting facts, and are hence at a multilinear model, as an alternative to this approach, also
disadvantage compared to MemNNs which use them4 . does similarly well and may be useful in real-world cases
For each task we use 1000 questions for training, and 1000 where N -grams cause the dictionary to be too large.
for testing. Learning rates and other hyperparameters are The final two columns (10-11) give further analysis of the
chosen using the training set. For all MemNN variants we AM+NG+NL MemNN method. The second to last column
fixed the embedding dimension to n = 50 for simplicity, (10) shows the minimum number of training examples re-
3 quired to achieve ≥ 95% accuracy, or FAIL if this is not
Constructing N -grams from all sentences rather than using
the filtered set gave worse results. achieved with 1000 examples. This is important as it is not
4
It is clearer to evaluate models in two tracks: fully and weakly only desirable to perform well on a task, but also using the
supervised. Weak supervision is ultimately desirable; full super- 5
vision gives accuracy upper bounds for “weak” models. The choice of 95% (and 1000 training examples) is arbitrary.
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks

fewest number of examples (to generalize well, quickly). Indicator variables for each pair of SRL verbs in fo1 and
Most tasks require 100-500 examples. Task 3.8 requires fo2 . (5) SRL Verb-Arg Pair: Indicator variables for each
5000 examples and 3.7 requires 10000, hence they are la- pair of SRL arguments in fo1 , fo2 and their corresponding
beled as FAIL. The latter task can presumably be solved by verbs. After finding the supporting facts, we build a similar
adding all the times an object is picked up, and subtract- structured SVM for the response stage, also with features
ing the times it is dropped, which seems possible for an tuned for that goal: Words – indicator for each word in x,
MemNN, but it does not do perfectly. Two tasks, positional Word Pairs – indicator for each pair of words in x and sup-
reasoning 3.17 and path finding 3.19 cannot be solved even porting facts, and similar SRL Verb and SRL Verb-Arg Pair
with 10000 examples, it seems those (and indeed more ad- features as before.
vanced forms of induction and deduction, which we plan
Results are given in Table 2. The structured SVM, despite
to build) require a general search algorithm to be built into
having access to external resources, does not perform bet-
the inference procedure, which MemNN are lacking.
ter than MemNNs overall, still failing at 9 tasks. It does
The last column shows the performance of AM+NG+NL perform well on tasks 3.6, 3.9 and 3.10 where the hand-
MemNNs when training on all the tasks jointly, rather than built feature conjunctions capture the necessary nonlineari-
just on a single one. The performance is generally encour- ties that the original MemNNs do not. However, it seems to
agingly similar, showing such a model can learn many as- do significantly worse on tasks requiring three (and some-
pects of text understanding and reasoning simultaneously. times, two) supporting facts (e.g. tasks 3.3, 3.16 and 3.2)
presumably as ranking over so many possibilities intro-
6.1. Baseline using External Resources duces more mistakes. However, its non-greedy search does
seem to help on other tasks, such as path finding (task 3.19)
We also built a classical cascade NLP system baseline where search is very important.
using a structured SVM, which incorporates coreference
resolution and semantic role labeling preprocessing steps, 7. Conclusion
which are themselves trained on large amounts of costly
labeled data. We first run the Stanford coreference system We developed a set of tasks that we believe are a prereq-
(Raghunathan et al., 2010) on the stories and each mention uisite to full language understanding and reasoning, which
is then replaced with the first mention of its entity class. include both training and testing data. While any learner
Second, the SENNA semantic role labeling system (SRL) that can solve these tasks is not necessarily close to solv-
(Collobert et al., 2011) is run, and we collect the set of ar- ing AI, we believe if a learner fails on any of our tasks it
guments for each verb. We then define a ranking task for exposes it is definitely not going to solve AI.
finding the supporting facts (trained using strong supervi- We also presented some models that attempt to solve these
sion): tasks. Overall, our experiments give further proof that
Memory Networks are an interesting model beyond the
o1 , o2 , o3 = arg max SO (x, fo1 , fo2 , fo3 ; Θ)
o∈O original paper. However, we also highlighted many flaws
in that model, which our proposed extensions ameliorate to
where given the question x we find at most three sup- a degree. The main issues are that the models still fail on
porting facts with indices oi from the set of facts f in several of the tasks, and use a far stronger form of supervi-
the story (we also consider selecting an “empty fact” for sion (using supporting facts) than is typically realistic.
the case of less than three), and SO is a linear scoring
function with parameters Θ. Computing the argmax re- We hope that future research will aim to minimize the
quires doing exhaustive search, unlike e.g. the MemNN amount of required supervision, as well as the number of
method which is greedy. For scalability, we thus prune training examples that has to be seen to solve a new task.
the set of possible matches by requiring that facts share For example, it seems that humans are able to generalize
one common non-determiner word with each other match to new tasks after seeing only couple of dozen of exam-
or with x. SO is constructed as a set of indicator fea- ples, without having any additional supervision signal. Fur-
tures. For simplicity each of the features only looks ther, our hope is that a feedback loop of developing more
at pairs of sentences, i.e. SO (x, fo1 , fo2 , fo3 ; Θ) = challenging tasks, and then algorithms that can solve them,
Θ∗(g(x, fo1 ), g(x, fo2 ), g(x, fo3 ), g(fo1 , fo2 ), g(fo2 , fo3 ), leads us in a fruitful research direction.
g(fo1 , fo3 )). The feature function g is made up of the fol-
lowing feature types, shown here for g(fo1 , fo2 ): (1) Word References
pairs: One indicator variable for each pair of words in fo1
Bache, K. and Lichman, M. UCI ma-
and fo2 . (2) Pair distance: Indicator for the distance be-
chine learning repository, 2013. URL
tween the sentence, i.e. o1 −o2 . (3) Pair order: Indicator for
http://archive.ics.uci.edu/ml.
the order of the sentence, i.e. o1 > o2 . (4) SRL Verb Pair:
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks

Table 2. Test accuracy (%) on our 20 Tasks for the baseline of Section 6.1 that uses external resources, comparing to various methods
from Table 1.
Weakly Strong Supervision Uses External
Supervised (using supporting facts) Resources

eat M
s
L f SV
ure
)
14
et a NN
20

F+ ed
L
G N+N
l.,
Cla -gram

SR
+ NmN

RE tur
est em
r

TM
fie

COruc
(W M
on
ssi

Me
LS
N
TASK

St
AM
3.1 - Single Supporting Fact 36 50 100 100 99
3.2 - Two Supporting Facts 2 20 100 100 74
3.3 - Three Supporting Facts 7 20 20 100 17
3.4 - Two Arg. Relations 50 61 71 100 98
3.5 - Three Arg. Relations 20 70 83 98 83
3.6 - Yes/No Questions 49 48 47 100 99
3.7 - Counting 52 49 68 85 69
3.8 - Lists/Sets 40 45 77 91 70
3.9 - Simple Negation 62 64 65 100 100
3.10 - Indefinite Knowledge 45 44 59 98 99
3.11 - Basic Coreference 29 72 100 100 100
3.12 - Conjunction 9 74 100 100 96
3.13 - Compound Coreference 26 94 100 100 99
3.14 - Time Reasoning 19 27 99 99 99
3.15 - Basic Deduction 20 21 74 100 96
3.16 - Basic Induction 43 23 27 100 24
3.17 - Positional Reasoning 46 51 54 65 61
3.18 - Size Reasoning 52 52 57 95 62
3.19 - Path Finding 0 8 0 36 49
3.20 - Agent’s Motivations 76 91 100 100 95
Mean Performance 34 49 75 93 79

Berant, Jonathan, Chou, Andrew, Frostig, Roy, and Liang, Fader, Anthony, Zettlemoyer, Luke, and Etzioni, Oren.
Percy. Semantic parsing on freebase from question- Open question answering over curated and extracted
answer pairs. In EMNLP, pp. 1533–1544, 2013. knowledge bases. In Proceedings of the 20th ACM
SIGKDD international conference on Knowledge dis-
Berant, Jonathan, Srikumar, Vivek, Chen, Pei-Chun, covery and data mining, pp. 1156–1165. ACM, 2014.
Huang, Brad, Manning, Christopher D, Vander Linden,
Abby, Harding, Brittany, and Clark, Peter. Modeling bi- Graves, Alex, Wayne, Greg, and Danihelka, Ivo. Neural
ological processes for reading comprehension. In Proc. turing machines. arXiv preprint arXiv:1410.5401, 2014.
EMNLP, 2014.
Haghighi, Aria and Klein, Dan. Simple coreference reso-
Bordes, Antoine, Usunier, Nicolas, Collobert, Ronan, and lution with rich syntactic and semantic features. In Pro-
Weston, Jason. Towards understanding situated natural ceedings of the 2009 Conference on Empirical Methods
language. In AISTATS, 2010. in Natural Language Processing: Volume 3-Volume 3,
Chen, David L and Mooney, Raymond J. Learning to in- pp. 1152–1161. Association for Computational Linguis-
terpret natural language navigation instructions from ob- tics, 2009.
servations. San Francisco, CA, pp. 859–865, 2011.
Halevy, Alon, Norvig, Peter, and Pereira, Fernando. The
Collobert, Ronan, Weston, Jason, Bottou, Léon, Karlen, unreasonable effectiveness of data. Intelligent Systems,
Michael, Kavukcuoglu, Koray, and Kuksa, Pavel. Nat- IEEE, 24(2):8–12, 2009.
ural language processing (almost) from scratch. The
Journal of Machine Learning Research, 12:2493–2537, Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-
2011. term memory. Neural computation, 9(8):1735–1780,
1997.
Fader, Anthony, Zettlemoyer, Luke, and Etzioni, Oren.
Paraphrase-driven learning for open question answering. Levesque, Hector J, Davis, Ernest, and Morgenstern,
In ACL, pp. 1608–1618, 2013. Leora. The winograd schema challenge. In AAAI Spring
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks

Symposium: Logical Formalizations of Commonsense Yao, Xuchen, Berant, Jonathan, and Van Durme, Benjamin.
Reasoning, 2011. Freebase qa: Information extraction or semantic pars-
ing? ACL 2014, pp. 82, 2014.
Liang, Percy. Lambda dependency-based compositional
semantics. arXiv preprint arXiv:1309.4408, 2013. Yu, Mo, Gormley, Matthew R, and Dredze, Mark. Factor-
based compositional embedding models. NIPS 2014
Liang, Percy, Jordan, Michael I, and Klein, Dan. Learning workshop on Learning Semantics, 2014.
dependency-based compositional semantics. Computa-
tional Linguistics, 39(2):389–446, 2013. Zhu, Xiaojin, Ghahramani, Zoubin, Lafferty, John, et al.
Semi-supervised learning using gaussian fields and har-
Minsky, Marvin and Papert, Seymour. Perceptron: an in- monic functions. In ICML, volume 3, pp. 912–919,
troduction to computational geometry. The MIT Press, 2003.
Cambridge, expanded edition, 19:88, 1969.

Montfort, Nick. Twisty Little Passages: an approach to

interactive fiction. Mit Press, 2005.

Müller, K-R, Smola, Alex J, Rätsch, Gunnar, Schölkopf,

Bernhard, Kohlmorgen, Jens, and Vapnik, Vladimir.
Predicting time series with support vector machines.
In Artificial Neural NetworksICANN’97, pp. 999–1004.
Springer, 1997.

Ng, Andrew Y, Jordan, Michael I, Weiss, Yair, et al. On

spectral clustering: Analysis and an algorithm. Advances
in neural information processing systems, 2:849–856,
2002.

Raghunathan, Karthik, Lee, Heeyoung, Rangarajan, Su-

darshan, Chambers, Nathanael, Surdeanu, Mihai, Juraf-
sky, Dan, and Manning, Christopher. A multi-pass sieve
for coreference resolution. In Proceedings of the 2010
Conference on Empirical Methods in Natural Language
Processing, pp. 492–501. Association for Computational
Linguistics, 2010.

Richardson, Matthew, Burges, Christopher JC, and Ren-

shaw, Erin. Mctest: A challenge dataset for the open-
domain machine comprehension of text. In EMNLP, pp.
193–203, 2013.

Rumelhart, David E, Hinton, Geoffrey E, and Williams,

Ronald J. Learning internal representations by error
propagation. Technical report, DTIC Document, 1985.

UzZaman, Naushad, Llorens, Hector, Allen, James, Der-

czynski, Leon, Verhagen, Marc, and Pustejovsky, James.
Tempeval-3: Evaluating events, time expressions, and
temporal relations. arXiv preprint arXiv:1206.5333,
2012.

Weston, Jason, Chopra, Sumit, and Bordes, Antoine. Mem-

ory networks. CoRR, abs/1410.3916, 2014.

Winograd, Terry. Understanding natural language. Cogni-

tive psychology, 3(1):1–191, 1972.

Transforming Education with AI: Guide to Understanding and Using ChatGPT in the Classroom
From Everand
Transforming Education with AI: Guide to Understanding and Using ChatGPT in the Classroom
Shane Snipes, PhD
No ratings yet
Logical Reasoning
No ratings yet
Logical Reasoning
12 pages
ChatGPT With Reinforcment Learning
No ratings yet
ChatGPT With Reinforcment Learning
71 pages
2023.emnlp Main.96SynthIE
No ratings yet
2023.emnlp Main.96SynthIE
20 pages
ChatGPT KZ Feb2023 PDF
No ratings yet
ChatGPT KZ Feb2023 PDF
7 pages
Question Answering System: 296: Natural Language Processing
No ratings yet
Question Answering System: 296: Natural Language Processing
30 pages
8 Quiz Maker Automatic Quiz Generation From Text Using NLP
No ratings yet
8 Quiz Maker Automatic Quiz Generation From Text Using NLP
11 pages
Intelligent Question Answering System
No ratings yet
Intelligent Question Answering System
50 pages
Pre Trained Models For NLP
No ratings yet
Pre Trained Models For NLP
15 pages
Scarfe Watcham Clarke Roesch 2023
No ratings yet
Scarfe Watcham Clarke Roesch 2023
33 pages
2005 14165v3 PDF
No ratings yet
2005 14165v3 PDF
74 pages
AI Guess Paper
No ratings yet
AI Guess Paper
14 pages
NLP PBL
No ratings yet
NLP PBL
21 pages
Question Generation by Transformers: Kettip Kriangchaivech and Artit Wangperawong
No ratings yet
Question Generation by Transformers: Kettip Kriangchaivech and Artit Wangperawong
7 pages
LSTM4
No ratings yet
LSTM4
5 pages
Ai Suggestions
No ratings yet
Ai Suggestions
6 pages
14 LookingForward
No ratings yet
14 LookingForward
48 pages
Plan Then Generate Controlled Data To-Text Generation Via Planning
No ratings yet
Plan Then Generate Controlled Data To-Text Generation Via Planning
15 pages
A Review On Question Generation From Natural Language Text
No ratings yet
A Review On Question Generation From Natural Language Text
43 pages
NLP Tutorial1
No ratings yet
NLP Tutorial1
7 pages
Time: 2 Hours) (Max. Marks: 70 Instructions To The Candidates
No ratings yet
Time: 2 Hours) (Max. Marks: 70 Instructions To The Candidates
3 pages
AI Answer Bank
No ratings yet
AI Answer Bank
70 pages
A Survey Large Language Models
No ratings yet
A Survey Large Language Models
58 pages
Wavelets Meet Large Language Models
No ratings yet
Wavelets Meet Large Language Models
16 pages
A Survey of Large Language Models
No ratings yet
A Survey of Large Language Models
58 pages
CS480 Lecture November 28th
No ratings yet
CS480 Lecture November 28th
96 pages
Better Distractions: Transformer-Based Distractor Generation and Multiple Choice Question Filtering
No ratings yet
Better Distractions: Transformer-Based Distractor Generation and Multiple Choice Question Filtering
10 pages
Report 24
No ratings yet
Report 24
7 pages
Learning To Answer by Learning To Ask - Getting The Best of GPT-2 and BERT Worlds PDF
No ratings yet
Learning To Answer by Learning To Ask - Getting The Best of GPT-2 and BERT Worlds PDF
10 pages
Survey On Large Language Models
No ratings yet
Survey On Large Language Models
52 pages
Nuro Symbolic AI 1706972510
No ratings yet
Nuro Symbolic AI 1706972510
38 pages
EasyChair Preprint 8588
No ratings yet
EasyChair Preprint 8588
13 pages
Lake Et Al 2017 BBS
No ratings yet
Lake Et Al 2017 BBS
72 pages
Looped Transformers As Programmable Computers
No ratings yet
Looped Transformers As Programmable Computers
64 pages
Icml2016 Memnn Tutorial
No ratings yet
Icml2016 Memnn Tutorial
89 pages
Augmenting LLMs Survey
No ratings yet
Augmenting LLMs Survey
33 pages
Lecture 15 - Foundation Models - CLIP and GPT
No ratings yet
Lecture 15 - Foundation Models - CLIP and GPT
45 pages
From GPT To BERT:: Benchmarking Large Language Models For Automated Iz Generation
No ratings yet
From GPT To BERT:: Benchmarking Large Language Models For Automated Iz Generation
2 pages
Physics of Language Models Part 3.2 Knowledge Manipulation
No ratings yet
Physics of Language Models Part 3.2 Knowledge Manipulation
29 pages
Augment LLM Universal!
No ratings yet
Augment LLM Universal!
23 pages
S C G Q G: Ynthetic Ontext Eneration For Uestion Eneration
No ratings yet
S C G Q G: Ynthetic Ontext Eneration For Uestion Eneration
9 pages
Very Good For Transformer
No ratings yet
Very Good For Transformer
34 pages
Turning Questions Into Dialogs To Teach Models How To Search
No ratings yet
Turning Questions Into Dialogs To Teach Models How To Search
14 pages
Downloed Papers
No ratings yet
Downloed Papers
700 pages
Kalyan 1 s2.0 S2949719123000456 Main
No ratings yet
Kalyan 1 s2.0 S2949719123000456 Main
48 pages
ML 22
No ratings yet
ML 22
29 pages
(2303.18223) A Survey of Large Language Models
No ratings yet
(2303.18223) A Survey of Large Language Models
115 pages
Applying Deep Learning To Answer Selection - A Study and An Open Task
No ratings yet
Applying Deep Learning To Answer Selection - A Study and An Open Task
8 pages
Unit 5
No ratings yet
Unit 5
60 pages
Enhancing Text-To-SQL Capabilities of Large Language Models
No ratings yet
Enhancing Text-To-SQL Capabilities of Large Language Models
22 pages
A Survey of Large Language Models
No ratings yet
A Survey of Large Language Models
124 pages
Participation Game (DSI 27 Apr) - A Post-Turing Frontier For Generative AI Systems
No ratings yet
Participation Game (DSI 27 Apr) - A Post-Turing Frontier For Generative AI Systems
39 pages
1 s2.0 S2095809922006324 Main
No ratings yet
1 s2.0 S2095809922006324 Main
20 pages
ChatGPT in The Age of Generative AI and Large Lang
No ratings yet
ChatGPT in The Age of Generative AI and Large Lang
60 pages
AI Question Bank
No ratings yet
AI Question Bank
2 pages
2023 Toward The Third Generation Artificial Intelligenc
No ratings yet
2023 Toward The Third Generation Artificial Intelligenc
19 pages
A Survey of Large Language Models
No ratings yet
A Survey of Large Language Models
97 pages
Machine Learning: Fundamentals and Applications
From Everand
Machine Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Essays in Computer-Supported Collaborative Learning: Gerry Stahl's eLibrary, #9
From Everand
Essays in Computer-Supported Collaborative Learning: Gerry Stahl's eLibrary, #9
Gerry Stahl
4/5 (3)
Conceptual Dependency Theory: Fundamentals and Applications
From Everand
Conceptual Dependency Theory: Fundamentals and Applications
Fouad Sabry
No ratings yet
Adaptive Predictive Power Management For Mobile LTE Devices
No ratings yet
Adaptive Predictive Power Management For Mobile LTE Devices
18 pages
XEmoAccent Embracing Diversity in Cross-Accent Emo
No ratings yet
XEmoAccent Embracing Diversity in Cross-Accent Emo
19 pages
Ai QQQQ
100% (1)
Ai QQQQ
23 pages
ARC180
No ratings yet
ARC180
103 pages
International Conference On Signal Processing Trends (SPT 2024)
No ratings yet
International Conference On Signal Processing Trends (SPT 2024)
2 pages
Steel Connection
No ratings yet
Steel Connection
13 pages
Bilingual AI-Driven Chatbot For Academic Advising
No ratings yet
Bilingual AI-Driven Chatbot For Academic Advising
9 pages
Case Study On Tcs3
No ratings yet
Case Study On Tcs3
26 pages
Chapter 9 - BDMT
No ratings yet
Chapter 9 - BDMT
61 pages
Digital Environment
No ratings yet
Digital Environment
12 pages
Internship Data Information Technology AY 2020 21 Compressed
No ratings yet
Internship Data Information Technology AY 2020 21 Compressed
346 pages
JDSC BCDEF
No ratings yet
JDSC BCDEF
21 pages
Intrusion Detection System
No ratings yet
Intrusion Detection System
45 pages
Data Science II: Charles C.N. Wang
No ratings yet
Data Science II: Charles C.N. Wang
38 pages
Garbage In, Garbage Out
No ratings yet
Garbage In, Garbage Out
12 pages
Congnizant Technoverse
No ratings yet
Congnizant Technoverse
8 pages
20CS065 68 FinalReport
No ratings yet
20CS065 68 FinalReport
22 pages
PBL PPT
No ratings yet
PBL PPT
13 pages
Handbook of Research On Disease Prediction Through Data Analytics and Machine Learning EPUB DOCX PDF Download
100% (8)
Handbook of Research On Disease Prediction Through Data Analytics and Machine Learning EPUB DOCX PDF Download
16 pages
Introduction To Machine Learning Top-Down Approach - Towards Data Science
No ratings yet
Introduction To Machine Learning Top-Down Approach - Towards Data Science
6 pages
AI in A Product's Journey: Deep Learning Assignment Group 9
No ratings yet
AI in A Product's Journey: Deep Learning Assignment Group 9
11 pages
Transcript Intro ML Course of Gatech
No ratings yet
Transcript Intro ML Course of Gatech
10 pages
Group Capstone Tushar
No ratings yet
Group Capstone Tushar
14 pages
Predicting Mental Health Illness Using Machine Learning Algorithms
No ratings yet
Predicting Mental Health Illness Using Machine Learning Algorithms
8 pages
AI Can Help To Speed Up Drug Discovery - But Only If We Give It The Right Data
No ratings yet
AI Can Help To Speed Up Drug Discovery - But Only If We Give It The Right Data
4 pages
Information Fusion: Sciencedirect
No ratings yet
Information Fusion: Sciencedirect
34 pages
Feature Selection Using Forest Optimization Algorithm
No ratings yet
Feature Selection Using Forest Optimization Algorithm
9 pages
Introducing Artificial Intelligence and Machine Le
No ratings yet
Introducing Artificial Intelligence and Machine Le
13 pages
Animal Behavior Project Documentation
No ratings yet
Animal Behavior Project Documentation
6 pages
AI & ML Unit 3 Notes
No ratings yet
AI & ML Unit 3 Notes
20 pages