ChiQA: A Large Scale Image-based Real-World Question Answering Dataset for Multi-Modal Understanding

Wang, Bingning; Lv, Feiyang; Yao, Ting; Yuan, Yiming; Ma, Jin; Luo, Yu; Liang, Haijin

Computer Science > Computation and Language

arXiv:2208.03030 (cs)

[Submitted on 5 Aug 2022]

Title:ChiQA: A Large Scale Image-based Real-World Question Answering Dataset for Multi-Modal Understanding

Authors:Bingning Wang, Feiyang Lv, Ting Yao, Yiming Yuan, Jin Ma, Yu Luo, Haijin Liang

View PDF

Abstract:Visual question answering is an important task in both natural language and vision understanding. However, in most of the public visual question answering datasets such as VQA, CLEVR, the questions are human generated that specific to the given image, such as `What color are her eyes?'. The human generated crowdsourcing questions are relatively simple and sometimes have the bias toward certain entities or attributes. In this paper, we introduce a new question answering dataset based on image-ChiQA. It contains the real-world queries issued by internet users, combined with several related open-domain images. The system should determine whether the image could answer the question or not. Different from previous VQA datasets, the questions are real-world image-independent queries that are more various and unbiased. Compared with previous image-retrieval or image-caption datasets, the ChiQA not only measures the relatedness but also measures the answerability, which demands more fine-grained vision and language reasoning. ChiQA contains more than 40K questions and more than 200K question-images pairs. A three-level 2/1/0 label is assigned to each pair indicating perfect answer, partially answer and irrelevant. Data analysis shows ChiQA requires a deep understanding of both language and vision, including grounding, comparisons, and reading. We evaluate several state-of-the-art visual-language models such as ALBEF, demonstrating that there is still a large room for improvements on ChiQA.

Comments:	CIKM2022 camera ready version
Subjects:	Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2208.03030 [cs.CL]
	(or arXiv:2208.03030v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2208.03030

Submission history

From: Bingning Wang Dr. [view email]
[v1] Fri, 5 Aug 2022 07:55:28 UTC (7,211 KB)

Computer Science > Computation and Language

Title:ChiQA: A Large Scale Image-based Real-World Question Answering Dataset for Multi-Modal Understanding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:ChiQA: A Large Scale Image-based Real-World Question Answering Dataset for Multi-Modal Understanding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators