DOCVQA1
DOCVQA1
Abstract
2200
domain have either focused on specic document elements lowed by more research [36, 11, 39].
such as data visualisations [19, 21] or on specic collections The ST-VQA dataset [5] has 31, 000+ questions over
such as book covers [28]. In contrast to such approaches, 23, 000+ images collected from different public data sets.
we recast the problem to its generic form, and put forward The TextVQA dataset [35] has 45, 000+ questions over
a large scale, varied collection of real documents. 28, 000+ images sampled from specic categories of the
Main contributions of this work can be summarized as OpenImages dataset [25] that are expected to contain text.
following: Another dataset named OCR-VQA [28] comprises more
• We introduce DocVQA, a large scale dataset of 12, 767 than 1 million question-answer pairs over 207K+ images of
document images of varied types and content, over book covers. The questions in this dataset are domain spe-
which we have dened 50, 000 questions and answers. cic, generated based on template questions and answers
The questions dened are categorised based on their extracted from available metadata.
reasoning requirements, allowing us to analyze how Scene text VQA methods [16, 11, 35, 12] typically make
DocVQA methods fare for different question types. use of pointer mechanisms in order to deal with out-of-
• We dene and evaluate various baseline methods over vocabulary (OOV) words appearing in the image and pro-
the DocVQA dataset, ranging from simple heuristic vide the open answer space required. This goes hand in
methods and human performance analysis that allow hand with the use of word embeddings capable of encoding
us to dene upper performance bounds given different OOV words into a pre-dened semantic space, such as Fast-
assumptions, to state of the art Scene Text VQA mod- Text [6] or BERT [9]. More recent, top-performing methods
els and NLP models. in this space include M4C [16] and MM-GNN [11] models.
Parallelly there have been works on certain domain spe-
2. Related Datasets and Tasks cic VQA tasks which require to read and understand text
in the images. The DVQA dataset presented by Kae et
Machine reading comprehension (MRC) and open-
al. [20, 19] comprises synthetically generated images of bar
domain question answering (QA) are two problems which
charts and template questions dened automatically based
are being actively pursued by Natural Language Process-
on the bar chart metadata. The dataset contains more than
ing (NLP) and Information Retrieval (IR) communities.
three million question-answer pairs over 300,000 images.
In MRC the task is to answer a natural language ques-
FigureQA [21] comprises over one million yes or no
tion given a question and a paragraph (or a single docu-
questions, grounded on over 100,000 images. Three differ-
ment) as the context. In case of open domain QA, no spe-
ent types of charts are used: bar, pie and line charts. Similar
cic context is given and answer need to be found from a
to DVQA, images are synthetically generated and questions
large collection (say Wikipedia) or from Web. MRC is of-
are generated from templates. Another related QA task is
ten modelled as an extractive QA problem where answer
Textbook Question Answering (TQA) [23] where multiple
is dened as a span of the context on which the ques-
choice questions are asked on multimodal context, includ-
tion is dened. Examples of datsets for extractive QA in-
ing text, diagrams and images. Here textual information is
clude SQuAD 1.1 [32], NewsQA [37] and Natural Ques-
provided in computer readable format.
tions [27]. MS MARCO [29] is an example of a QA dataset
Compared to these existing datasets either concerning
for abstractive QA where answers need to be generated not
VQA on real word images, or domain specic VQA for
extracted. Recently Transformer based pretraining meth-
charts or book covers, the proposed DocVQA comprise
ods like Bidirectional Encoder Representations from Trans-
document images. The dataset covers a multitude of differ-
formers (BERT) [9] and XLNet [41] have helped to build
ent document types that include elements like tables, forms
QA models outperforming Humans on reading comprehen-
and gures , as well as a range of different textual, graphical
sion on SQuAD [32]. In contrast to QA in NLP where con-
and structural elements.
text is given as computer readable strings, contexts in case
of DocVQA are document images.
Visual Question Answering (VQA) aims to provide an
3. DocVQA
accurate natural language answer given an image and a nat- In this section we explain data collection and annotation
ural language question. VQA has attracted an intense re- process and present statistics and analysis of DocVQA.
search effort over the past few years [13, 1, 17]. Out of
a large body of work on VQA, scene text VQA branch is 3.1. Data Collection
the most related to our work. Scene text VQA refers to Document Images: Images in the dataset are sourced
VQA systems aiming to deal with cases where understand- from documents in UCSF Industry Documents Library1 .
ing scene text instances is necessary to respond to the ques- The documents are organized under different industries and
tions posed. The ST-VQA [5] and TextVQA [35] datasets
were introduced in parallel in 2019 and were quickly fol- 1 https://www.industrydocuments.ucsf.edu/
2201
(a) Industry-wise distribution of the docu- (b) Year wise distribution of the documents. (c) Various types of documents used.
ments.
Figure 2: Document images we use in the dataset come from 6071 documents spanning many decades, of a variety of types,
originating from 5 different industries. We use documents from UCSF Industry Documents Library.
further under different collections. We downloaded doc- answers for the questions. In this stage workers were also
uments from different collections and hand picked pages required to assign one or more question types to each ques-
from these documents for use in the dataset. Majority of tion. The different question types in DocVQA are discussed
documents in the library are binarized and the binarization in subsection 3.2. During the second stage, if the worker
has taken on a toll on the image quality. We tried to min- nds a question inapt owing to language issues or ambi-
imize binarized images in DocVQA since we did not want guity, an option to ag the question was provided. Such
poor image quality to be a bottleneck for VQA. We also questions are not included in the dataset.
prioritized pages with tables, forms, lists and gures over If none of the answers entered in the rst stage match
pages which only have running text. exactly with any of the answers from the second stage, the
The nal set of images in the dataset are drawn from particular question is sent for review in a third stage. Here
pages of 6, 071 industry documents. We made use of doc- questions and answers are editable and the reviewer either
uments from as early as 1900 to as recent as 2018. ( Fig- accepts the question-answer (after editing if necessary) or
ure 2b). Most of the documents are from the 1960-2000 pe- ignores it. The third stage review is done by the authors
riod and they include typewritten, printed, handwritten and themselves.
born-digital text. There are documents from all 5 major in-
dustries for which the library hosts documents — tobacco, 3.2. Statistics and Analysis
food, drug, fossil fuel and chemical. We use many docu-
The DocVQA comprises 50, 000 questions framed on
ments from food and nutrition related collections, as they
12, 767 images. The data is split randomly in an 80−10−10
have a good number of non-binarized images. See Fig-
ratio to train, validation and test splits. The train split has
ure 2a for industry wise distribution of the 6071 documents
39, 463 questions and 10, 194 images, the validation split
used. The documents comprise a wide variety of document
has 5, 349 questions and 1, 286 images and the test split has
types as shown in Figure 2c.
5, 188 questions and 1, 287 images.
Questions and Answers: Questions and answers on the
As mentioned before, questions are tagged with ques-
selected document images are collected with the help of re-
tion type(s) during the second stage of the annotation pro-
mote workers, using a Web based annotation tool.
The annotation process was organized in three stages. In
stage 1, workers were shown a document image and asked
to dene at most 10 question-answer pairs on it. We encour-
aged the workers to add more than one ground truth answer
per question in cases where it is warranted.
Workers were instructed to ask questions which can be
answered using text present in the image and to enter the
answer verbatim from the document. This makes VQA on
the DocVQA dataset an extractive QA problem similar to
extractive QA tasks in NLP [32, 37] and VQA in case of
ST-VQA [5]. The second annotation stage aims to verify the
data collected in the rst stage. Here a worker was shown an
image and questions dened on it in the rst stage (but not Figure 3: The 9 question types and share of questions in
the answers from the rst stage), and was required to enter each type.
2202
(a) Top 15 most frequent questions. (b) Top 15 most frequent answers. (c) Top 15 non numeric answers.
(d) Questions with a particular length. (e) Answers with a particular length. (f) Images/contexts with a particular length
Figure 4: Question, answer and OCR tokens’ statistics compared to similar datasets from VQA — VQA 2.0 [13], ST-VQA [5]
and TextVQA [35] and SQuAD 1.1 [32] reading comprehension dataset.
cess. Figure 3 shows the 9 question types and percentage highest among the compared datasets. In DocVQA 35, 362
of questions under each type. A question type signies the (70.72%) questions are unique. Figure 4a shows the top
type of data where the question is grounded. For example, 15 most frequent questions and their frequencies. There
‘table/list’ is assigned if answering the question requires un- are questions repeatedly being asked about dates, titles and
derstanding of a table or a list. If the information is in the page numbers. A sunburst of rst 4 words of the questions
form of a key:value, the ‘form’ type is assigned. ‘Layout’ is shown in Figure 6. It can be seen that a large majority
is assigned for questions which require spatial/layout infor- of questions start with “what is the”, asking for date, title,
mation to nd the answer. For example, questions asking total, amount or name.
for a title or heading, require one to understand structure of Distribution of answer lengths is shown in Figure 4e. We
the document. If answer for a question is based on infor- observe in the gure that both DocVQA and SQuAD 1.1
mation in the form of sentences/paragraphs type assigned is have a higher number of longer answers compared to the
‘running text’. For all questions where answer is based on VQA datasets. The average answer length is 2.17. 63.2%
handwritten text, ‘handwritten’ type is assigned. Note that of the answers are unique , which is second only to SQuAD
a question can have more than one type associated with it. 1.1 (72.5%). The top 15 answers in the dataset are shown
(Examples from DocVQA for each question type are given in Figure 4b. We observe that almost all of the top answers
in the supplementary.) are numeric values, which is expected since there are a good
In the following analysis we compare statistics of ques- number of document images of reports and invoices. In Fig-
tions, answers and OCR tokens with other similar datasets ure 4c we show the top 15 non numeric answers. These in-
for VQA — VQA 2.0 [13], TextVQA [35] and ST-VQA [5]
and SQuAD 1.1 [32] dataset for reading comprehension.
Statistics for other datasets are computed based on their
publicly available data splits. For statistics on OCR to-
kens, for DocVQA we use OCR tokens generated by a
commercial OCR solution. For VQA 2.0, TextVQA and
ST-VQA we use OCR tokens made available by authors of
LoRRA [35] and M4C [16] as part of the MMF [34] frame-
work.
Figure 4d shows distribution of question lengths for
questions in DocVQA compared to other similar datasets. Figure 5: Word clouds of words in answers (left) and words
The average question length is is 8.12, which is second spotted on the document images in the dataset (right)
2203
is selected as the answer. (iv) Majority answer measures
the performance when the most frequent answer in the train
split is considered as the answer.
We also compute following upper bounds: (i) Vocab
UB: This upper bound measures performance upper bound
one can get by predicting correct answers for the questions,
provided the correct answer is present in a vocabulary of an-
swers, comprising all answers which occur more than once
in the train split. (ii) OCR substring UB: is the upper
bound on predicting the correct answer provided the answer
can be found as a substring in the sequence of OCR tokens.
The sequence is made by serializing the OCR tokens recog-
nized in the documents as a sequence separated by space, in
top-left to bottom-right order. (iii) OCR subsequence UB:
upper bound on predicting the correct answer, provided the
answer is a subsequence of the OCR tokens’ sequence.
2204