0% found this document useful (0 votes)
57 views5 pages

DOCVQA1

The document introduces a new dataset called DocVQA for visual question answering (VQA) on document images. The dataset consists of 50,000 questions defined on over 12,000 document images. Baseline results are reported using existing VQA and reading comprehension models, showing there is room for improvement especially on questions requiring document structure understanding. The dataset, code and leaderboard are made publicly available to inspire purpose-driven document analysis and recognition research.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views5 pages

DOCVQA1

The document introduces a new dataset called DocVQA for visual question answering (VQA) on document images. The dataset consists of 50,000 questions defined on over 12,000 document images. Baseline results are reported using existing VQA and reading comprehension models, showing there is room for improvement especially on questions requiring document structure understanding. The dataset, code and leaderboard are made publicly available to inspire purpose-driven document analysis and recognition research.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

DocVQA: A Dataset for VQA on Document Images

Minesh Mathew1 Dimosthenis Karatzas2 C.V. Jawahar1


1 2
CVIT, IIIT Hyderabad, India Computer Vision Center, UAB, Spain
[email protected], [email protected], [email protected]

Abstract

We present a new dataset for Visual Question Answering


(VQA) on document images called DocVQA. The dataset
consists of 50,000 questions dened on 12,000+ document
images. Detailed analysis of the dataset in comparison with
similar datasets for VQA and reading comprehension is pre-
sented. We report several baseline results by adopting exist-
ing VQA and reading comprehension models. Although the
existing models perform reasonably well on certain types
of questions, there is large performance gap compared to
human performance (94.36% accuracy). The models need
to improve specically on questions where understanding
structure of the document is crucial. The dataset, code and Q: Mention the ZIP code written?
leaderboard are available at docvqa.org A: 80202
Q: What date is seen on the seal at the top of the letter?
A: 23 sep 1970
1. Introduction Q: Which company address is mentioned on the letter?
A: Great western sugar Co.
Research in Document Analysis and Recognition (DAR)
is generally focused on information extraction tasks that Figure 1: Example question-answer pairs from DocVQA.
aim to convert information in document images into ma- Answering questions in the new dataset require models not
chine readable form, such as character recognition [10], ta- just to read text but interpret it within the layout/structure of
ble extraction [22] or key-value pair extraction [30]. Such the document.
algorithms tend to be designed as task specic blocks, blind
to the end-purpose the extracted information will be used
for. users. To do so, reading systems should not only extract and
Progressing independently in such information extrac- interpret the textual (handwritten, typewritten or printed)
tion processes has been quite successful, although it is not content of the document images, but exploit numerous other
necessarily true that holistic document image understanding visual cues including layout (page structure, forms, tables),
can be achieved through a simple constructionist approach, non-textual elements (marks, tick boxes, separators, dia-
building upon such modules. The scale and complexity of grams) and style (font, colours, highlighting), to mention
the task introduce difculties that require a different point just a few.
of view. Departing from generic VQA [13] and Scene Text VQA
In this article we introduce Document Visual Question [35, 5] approaches, the document images warrants a differ-
Answering (DocVQA), as a high-level task dynamically ent approach to exploit all the above visual cues, making
driving DAR algorithms to conditionally interpret docu- use of prior knowledge of the implicit written communi-
ment images. By doing so, we seek to inspire a “purpose- cation conventions used, and dealing with the high-density
driven” point of view in DAR research. In case of Docu- semantic information conveyed in such images. Answers
ment VQA, as illustrated in Figure 1, an intelligent reading in case of document VQA cannot be sourced from a closed
system is expected to respond to ad-hoc requests for infor- dictionary, but they are inherently open ended.
mation, expressed in natural language questions by human Previous approaches on bringing VQA to the documents

2200
domain have either focused on specic document elements lowed by more research [36, 11, 39].
such as data visualisations [19, 21] or on specic collections The ST-VQA dataset [5] has 31, 000+ questions over
such as book covers [28]. In contrast to such approaches, 23, 000+ images collected from different public data sets.
we recast the problem to its generic form, and put forward The TextVQA dataset [35] has 45, 000+ questions over
a large scale, varied collection of real documents. 28, 000+ images sampled from specic categories of the
Main contributions of this work can be summarized as OpenImages dataset [25] that are expected to contain text.
following: Another dataset named OCR-VQA [28] comprises more
• We introduce DocVQA, a large scale dataset of 12, 767 than 1 million question-answer pairs over 207K+ images of
document images of varied types and content, over book covers. The questions in this dataset are domain spe-
which we have dened 50, 000 questions and answers. cic, generated based on template questions and answers
The questions dened are categorised based on their extracted from available metadata.
reasoning requirements, allowing us to analyze how Scene text VQA methods [16, 11, 35, 12] typically make
DocVQA methods fare for different question types. use of pointer mechanisms in order to deal with out-of-
• We dene and evaluate various baseline methods over vocabulary (OOV) words appearing in the image and pro-
the DocVQA dataset, ranging from simple heuristic vide the open answer space required. This goes hand in
methods and human performance analysis that allow hand with the use of word embeddings capable of encoding
us to dene upper performance bounds given different OOV words into a pre-dened semantic space, such as Fast-
assumptions, to state of the art Scene Text VQA mod- Text [6] or BERT [9]. More recent, top-performing methods
els and NLP models. in this space include M4C [16] and MM-GNN [11] models.
Parallelly there have been works on certain domain spe-
2. Related Datasets and Tasks cic VQA tasks which require to read and understand text
in the images. The DVQA dataset presented by Kae et
Machine reading comprehension (MRC) and open-
al. [20, 19] comprises synthetically generated images of bar
domain question answering (QA) are two problems which
charts and template questions dened automatically based
are being actively pursued by Natural Language Process-
on the bar chart metadata. The dataset contains more than
ing (NLP) and Information Retrieval (IR) communities.
three million question-answer pairs over 300,000 images.
In MRC the task is to answer a natural language ques-
FigureQA [21] comprises over one million yes or no
tion given a question and a paragraph (or a single docu-
questions, grounded on over 100,000 images. Three differ-
ment) as the context. In case of open domain QA, no spe-
ent types of charts are used: bar, pie and line charts. Similar
cic context is given and answer need to be found from a
to DVQA, images are synthetically generated and questions
large collection (say Wikipedia) or from Web. MRC is of-
are generated from templates. Another related QA task is
ten modelled as an extractive QA problem where answer
Textbook Question Answering (TQA) [23] where multiple
is dened as a span of the context on which the ques-
choice questions are asked on multimodal context, includ-
tion is dened. Examples of datsets for extractive QA in-
ing text, diagrams and images. Here textual information is
clude SQuAD 1.1 [32], NewsQA [37] and Natural Ques-
provided in computer readable format.
tions [27]. MS MARCO [29] is an example of a QA dataset
Compared to these existing datasets either concerning
for abstractive QA where answers need to be generated not
VQA on real word images, or domain specic VQA for
extracted. Recently Transformer based pretraining meth-
charts or book covers, the proposed DocVQA comprise
ods like Bidirectional Encoder Representations from Trans-
document images. The dataset covers a multitude of differ-
formers (BERT) [9] and XLNet [41] have helped to build
ent document types that include elements like tables, forms
QA models outperforming Humans on reading comprehen-
and gures , as well as a range of different textual, graphical
sion on SQuAD [32]. In contrast to QA in NLP where con-
and structural elements.
text is given as computer readable strings, contexts in case
of DocVQA are document images.
Visual Question Answering (VQA) aims to provide an
3. DocVQA
accurate natural language answer given an image and a nat- In this section we explain data collection and annotation
ural language question. VQA has attracted an intense re- process and present statistics and analysis of DocVQA.
search effort over the past few years [13, 1, 17]. Out of
a large body of work on VQA, scene text VQA branch is 3.1. Data Collection
the most related to our work. Scene text VQA refers to Document Images: Images in the dataset are sourced
VQA systems aiming to deal with cases where understand- from documents in UCSF Industry Documents Library1 .
ing scene text instances is necessary to respond to the ques- The documents are organized under different industries and
tions posed. The ST-VQA [5] and TextVQA [35] datasets
were introduced in parallel in 2019 and were quickly fol- 1 https://www.industrydocuments.ucsf.edu/

2201
(a) Industry-wise distribution of the docu- (b) Year wise distribution of the documents. (c) Various types of documents used.
ments.
Figure 2: Document images we use in the dataset come from 6071 documents spanning many decades, of a variety of types,
originating from 5 different industries. We use documents from UCSF Industry Documents Library.

further under different collections. We downloaded doc- answers for the questions. In this stage workers were also
uments from different collections and hand picked pages required to assign one or more question types to each ques-
from these documents for use in the dataset. Majority of tion. The different question types in DocVQA are discussed
documents in the library are binarized and the binarization in subsection 3.2. During the second stage, if the worker
has taken on a toll on the image quality. We tried to min- nds a question inapt owing to language issues or ambi-
imize binarized images in DocVQA since we did not want guity, an option to ag the question was provided. Such
poor image quality to be a bottleneck for VQA. We also questions are not included in the dataset.
prioritized pages with tables, forms, lists and gures over If none of the answers entered in the rst stage match
pages which only have running text. exactly with any of the answers from the second stage, the
The nal set of images in the dataset are drawn from particular question is sent for review in a third stage. Here
pages of 6, 071 industry documents. We made use of doc- questions and answers are editable and the reviewer either
uments from as early as 1900 to as recent as 2018. ( Fig- accepts the question-answer (after editing if necessary) or
ure 2b). Most of the documents are from the 1960-2000 pe- ignores it. The third stage review is done by the authors
riod and they include typewritten, printed, handwritten and themselves.
born-digital text. There are documents from all 5 major in-
dustries for which the library hosts documents — tobacco, 3.2. Statistics and Analysis
food, drug, fossil fuel and chemical. We use many docu-
The DocVQA comprises 50, 000 questions framed on
ments from food and nutrition related collections, as they
12, 767 images. The data is split randomly in an 80−10−10
have a good number of non-binarized images. See Fig-
ratio to train, validation and test splits. The train split has
ure 2a for industry wise distribution of the 6071 documents
39, 463 questions and 10, 194 images, the validation split
used. The documents comprise a wide variety of document
has 5, 349 questions and 1, 286 images and the test split has
types as shown in Figure 2c.
5, 188 questions and 1, 287 images.
Questions and Answers: Questions and answers on the
As mentioned before, questions are tagged with ques-
selected document images are collected with the help of re-
tion type(s) during the second stage of the annotation pro-
mote workers, using a Web based annotation tool.
The annotation process was organized in three stages. In
stage 1, workers were shown a document image and asked
to dene at most 10 question-answer pairs on it. We encour-
aged the workers to add more than one ground truth answer
per question in cases where it is warranted.
Workers were instructed to ask questions which can be
answered using text present in the image and to enter the
answer verbatim from the document. This makes VQA on
the DocVQA dataset an extractive QA problem similar to
extractive QA tasks in NLP [32, 37] and VQA in case of
ST-VQA [5]. The second annotation stage aims to verify the
data collected in the rst stage. Here a worker was shown an
image and questions dened on it in the rst stage (but not Figure 3: The 9 question types and share of questions in
the answers from the rst stage), and was required to enter each type.

2202
(a) Top 15 most frequent questions. (b) Top 15 most frequent answers. (c) Top 15 non numeric answers.

(d) Questions with a particular length. (e) Answers with a particular length. (f) Images/contexts with a particular length
Figure 4: Question, answer and OCR tokens’ statistics compared to similar datasets from VQA — VQA 2.0 [13], ST-VQA [5]
and TextVQA [35] and SQuAD 1.1 [32] reading comprehension dataset.

cess. Figure 3 shows the 9 question types and percentage highest among the compared datasets. In DocVQA 35, 362
of questions under each type. A question type signies the (70.72%) questions are unique. Figure 4a shows the top
type of data where the question is grounded. For example, 15 most frequent questions and their frequencies. There
‘table/list’ is assigned if answering the question requires un- are questions repeatedly being asked about dates, titles and
derstanding of a table or a list. If the information is in the page numbers. A sunburst of rst 4 words of the questions
form of a key:value, the ‘form’ type is assigned. ‘Layout’ is shown in Figure 6. It can be seen that a large majority
is assigned for questions which require spatial/layout infor- of questions start with “what is the”, asking for date, title,
mation to nd the answer. For example, questions asking total, amount or name.
for a title or heading, require one to understand structure of Distribution of answer lengths is shown in Figure 4e. We
the document. If answer for a question is based on infor- observe in the gure that both DocVQA and SQuAD 1.1
mation in the form of sentences/paragraphs type assigned is have a higher number of longer answers compared to the
‘running text’. For all questions where answer is based on VQA datasets. The average answer length is 2.17. 63.2%
handwritten text, ‘handwritten’ type is assigned. Note that of the answers are unique , which is second only to SQuAD
a question can have more than one type associated with it. 1.1 (72.5%). The top 15 answers in the dataset are shown
(Examples from DocVQA for each question type are given in Figure 4b. We observe that almost all of the top answers
in the supplementary.) are numeric values, which is expected since there are a good
In the following analysis we compare statistics of ques- number of document images of reports and invoices. In Fig-
tions, answers and OCR tokens with other similar datasets ure 4c we show the top 15 non numeric answers. These in-
for VQA — VQA 2.0 [13], TextVQA [35] and ST-VQA [5]
and SQuAD 1.1 [32] dataset for reading comprehension.
Statistics for other datasets are computed based on their
publicly available data splits. For statistics on OCR to-
kens, for DocVQA we use OCR tokens generated by a
commercial OCR solution. For VQA 2.0, TextVQA and
ST-VQA we use OCR tokens made available by authors of
LoRRA [35] and M4C [16] as part of the MMF [34] frame-
work.
Figure 4d shows distribution of question lengths for
questions in DocVQA compared to other similar datasets. Figure 5: Word clouds of words in answers (left) and words
The average question length is is 8.12, which is second spotted on the document images in the dataset (right)

2203
is selected as the answer. (iv) Majority answer measures
the performance when the most frequent answer in the train
split is considered as the answer.
We also compute following upper bounds: (i) Vocab
UB: This upper bound measures performance upper bound
one can get by predicting correct answers for the questions,
provided the correct answer is present in a vocabulary of an-
swers, comprising all answers which occur more than once
in the train split. (ii) OCR substring UB: is the upper
bound on predicting the correct answer provided the answer
can be found as a substring in the sequence of OCR tokens.
The sequence is made by serializing the OCR tokens recog-
nized in the documents as a sequence separated by space, in
top-left to bottom-right order. (iii) OCR subsequence UB:
upper bound on predicting the correct answer, provided the
answer is a subsequence of the OCR tokens’ sequence.

4.2. VQA Models


For evaluating performance of existing VQA models
Figure 6: Distribution of questions by their starting 4- on DocVQA we employ two models which take the text
grams. Most questions aim to retrieve common data points present in the images into consideration while answering
in documents such as date, title, total mount and page num- questions – Look, Read, Reason & Answer (LoRRA) [35]
ber. and Multimodal Multi-Copy Mesh(M4C) [16].
LoRRA: follows a bottom-up and top-down attention [3]
scheme with additional bottom-up attention over OCR to-
clude named entities such as names of people, institutions
kens from the images. In LoRRA, tokens in a question are
and places. The word cloud on the left in Figure 5 shows
rst embedded using a pre-trained embedding (GloVe [31])
frequent words in answers. Most common words are names
and then these tokens are iteratively encoded using an
of people and names of calendar months.
LSTM [15] encoder. The model uses two types of spa-
In Figure 4f we show the number of images (or ‘context’s
tial features to represent the visual information from the
in case of SQuAD 1.1) containing a particular number of
images - (i) grid convolutional features from a Resnet-
text tokens. Average number of text tokens in an image
152 [14] which is pre-trained on ImageNet [8] and (ii) fea-
or context is the highest in the case of DocVQA (182.75).
tures extracted from bounding box proposals from a Faster
It is considerably higher compared to SQuAD 1.1 where
R-CNN [33] object detection model, pre-trained on Visual
contexts are usually small paragraphs whose average length
Genome [26]. OCR tokens from the image are embedded
is 117.23. In case of VQA datasets which comprise real
using a pre-trained word embedding (FastText [7]). An at-
world images average number of OCR tokens is not more
tention mechanism is used to compute an attention weighed
than 13. Word cloud on the right in Figure 5 shows the
average of the image features as well the OCR tokens’ em-
most common words spotted by the OCR on the images in
beddings. These averaged features are combined and fed
DocVQA. We observe that there is high overlap between
into an output module. The classication layer of the model,
common OCR tokens and words in answers.
predicts an answer either from a xed vocabulary (made
4. Baselines from answers in train set) or copy an answer from a dynamic
vocabulary which essentially is the list of OCR tokens in an
In this section we explain the baselines we use, including image. Here the copy mechanism can copy only one of the
heuristics and trained models. OCR tokens from the image. Consequently it cannot out-
put an answer which is a combination of two or more OCR
4.1. Heuristics and Upper Bounds tokens.
Heuristics we evaluate are: (i) Random answer: mea- M4C: uses a multimodal transformer and an iterative an-
sures performance when we pick a random answer from the swer prediction module. Here tokens in questions are em-
answers in the train split. (ii) Random OCR token: perfor- bedded using a BERT model [9]. Images are represented
mance when a random OCR token from the given document using (i) appearance features of the objects detected using
image is picked as the answer. (iii) Longest OCR token is a Faster-RCNN pretrained on Visual Genome [26] and (ii)
the case when the longest OCR token in the given document location information - bounding box coordinates of the de-

2204

You might also like