Quasar: Datasets for Question Answering by Search and Reading

Dhingra, Bhuwan; Mazaitis, Kathryn; Cohen, William W.

Computer Science > Computation and Language

arXiv:1707.03904 (cs)

[Submitted on 12 Jul 2017 (v1), last revised 9 Aug 2017 (this version, v2)]

Title:Quasar: Datasets for Question Answering by Search and Reading

Authors:Bhuwan Dhingra, Kathryn Mazaitis, William W. Cohen

View PDF

Abstract:We present two new large-scale datasets aimed at evaluating systems designed to comprehend a natural language query and extract its answer from a large corpus of text. The Quasar-S dataset consists of 37000 cloze-style (fill-in-the-gap) queries constructed from definitions of software entity tags on the popular website Stack Overflow. The posts and comments on the website serve as the background corpus for answering the cloze questions. The Quasar-T dataset consists of 43000 open-domain trivia questions and their answers obtained from various internet sources. ClueWeb09 serves as the background corpus for extracting these answers. We pose these datasets as a challenge for two related subtasks of factoid Question Answering: (1) searching for relevant pieces of text that include the correct answer to a query, and (2) reading the retrieved text to answer the query. We also describe a retrieval system for extracting relevant sentences and documents from the corpus given a query, and include these in the release for researchers wishing to only focus on (2). We evaluate several baselines on both datasets, ranging from simple heuristics to powerful neural models, and show that these lag behind human performance by 16.4% and 32.1% for Quasar-S and -T respectively. The datasets are available at this https URL .

Subjects:	Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Cite as:	arXiv:1707.03904 [cs.CL]
	(or arXiv:1707.03904v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.1707.03904

Submission history

From: Bhuwan Dhingra [view email]
[v1] Wed, 12 Jul 2017 20:53:26 UTC (630 KB)
[v2] Wed, 9 Aug 2017 01:48:08 UTC (630 KB)

Computer Science > Computation and Language

Title:Quasar: Datasets for Question Answering by Search and Reading

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Quasar: Datasets for Question Answering by Search and Reading

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators