Docvqa 2
Docvqa 2
1 Introduction
Documents are essential for humans since they have been used to store knowledge
and information over the history. For this reason there has been a strong research
eort on improving the machine understanding of documents. The research eld
of Document Analysis and Recognition (DAR) aims at the automatic extraction
of information presented on paper, initially addressed to human comprehension.
Some of the most widely known applications of DAR involve processing oce
documents by recognizing text [14], tables and forms layout [8], mathematical
expressions [25] and visual information like gures and graphics [30]. However,
even though all these research elds have progressed immensely during the last
decades, they have been agnostic to the end purpose they can be used for.
Moreover, despite the fact that document collections are as ancient as documents
themselves, the research in this scope has been limited to document retrieval by
lexical content in word spotting [22,27], blind to the semantics and ignoring the
task of extracting higher level information from those collections.
On the other hand, over the past few years Visual Question Answering (VQA)
has been one of the major relevant tasks as a link between vision and language.
Even though the works of [6] and [31] start considering text in VQA by requiring
the methods to read the text in the images to answer the questions, they con-
strained the problem to natural scenes. It was [24] who rst introduced VQA on
documents. However, none of those previous works consider the image collection
perspective, neither from real scenes nor documents.
2 R. Tito et al.
Q: In which years did Anna M. Rivers run for the State senator oce?
A: [2016, 2020]
E: [454, 10901]
Fig. 1. Top: Partial visualization of sample documents in DocCVQA. The left docu-
ment corresponds to the document with ID 454, which is one of the relevant documents
to answer the question below. Bottom: Example question from the sample set, its an-
swer and their evidences. In DocCVQA the evidences are the documents where the
answer can be inferred from. In this example, the correct answer are the years 2016
and 2020, and the evidences are the document images with ids 454 and 10901 which
corresponds to the forms where Anna M. Rivers presented as a candidate for the State
senator oce.
Document Collection Visual Question Answering 3
2 Related Work
2.1 Document understanding
Document understanding has been largely investigated within the document
analysis community with the nal goal of automatically extracting relevant in-
formation from documents. Most works have focused on structured or semi-
structured documents such as forms, invoices, receipts, passports or ID cards,
e-mails, contracts, etc. Earlier works [9,29] were based on a predened set of
rules that required the denition of specic templates for each new type of doc-
ument. Later on, learning-based methods [8,26] allowed to automatically classify
the type of document and identify relevant elds of information without prede-
ned templates. Recent advances on deep learning [20,36,37] leverage natural
language processing, visual feature extraction and graph-based representations
in order to have a more global view of the document that take into account word
semantics and visual layout in the process of information extraction.
All these methods mainly focus on extracting key-value pairs, following a
bottom-up approach, from the document features to the relevant semantic in-
formation. The task proposed in this work takes a dierent top-down approach,
using the visual question answering paradigm, where the goal drives the search
of information in the document.
Nonetheless, the eld became very popular and several new datasets were re-
leased exploring new challenges like ST-VQA [6] and TextVQA [31], which were
the rst datasets that considered the text in the scene. In the former dataset, the
answers are always contained within the text found in the image while the latter
requires to read the text, but the answer might not be a direct transcription
of the recognized text. The incorporation of text in VQA posed two main chal-
lenges. First, the number of classes as possible answers grew exponentially and
second, the methods had to deal with a lot of out of vocabulary (OOV) words
both as answers or as input recognized text. To address the problem of OOV
words, embeddings such as Fasttext [7] and PHOC [1] became more popular,
while in order to predict an answer, along with the standard xed vocabulary
with the most common answers a copy mechanism was introduced by [31] which
allowed to propose an OCR token as an answer. Later [13] changed the clas-
sication output to a decoder that outputs a word from the xed vocabulary
or from the recognized text at each timestep, and provided more exibility in
complex and longer answers.
Concerning documents, FigureQA [16] and DVQA [15] focused on complex
gures and data representation like dierent kinds of charts and plots by propos-
ing synthetic datasets and corresponding questions and answers over those g-
ures. More recently, [23] proposed DocVQA, the rst VQA dataset over docu-
ment images, where the questions also refer to gures, forms or tables but also
text in complex layouts. Along with the dataset they proposed some baselines
based on NLP and scene text VQA models. In this sense, we go a step further
extending this work for document collections.
Finally, one of the most relevant works for this paper is ISVQA [4] where
the questions are asked over a small set of images which consist of dierent
perspectives of the same scene. Notice that even though the set up might seem
similar, the methods to tackle this dataset and the one we propose are very
dierent. For ISVQA all the images share the same context, which implies that
nding some information in one of the images can be useful for the other images
in the set. In addition, the image sets are always small sets of 6 images, in
contrast to the whole collection of DocCVQA and nally, the images are about
real scenes which don’t even consider the text. As an example, the baselines
they propose are based on the HME-VideoQA [11] and standard VQA methods
stitching all the images, or the images features. Which are not suitable to our
problem.
3 DocCVQA Dataset
In this section we describe the process for collecting images, questions and an-
swers, an analysis of the collected data and nally, we describe the metric used
for the evaluation of this task.
Document Collection Visual Question Answering 5
ID Question
8 Which candidates in 2008 were from the Republican party?
9 Which candidates ran for the State Representative oce between
06/01/2012 and 12/31/2012?
10 In which legislative counties did Gary L. Schoessler run for County Com-
missioner?
11 For which candidates was Danielle Westbrook the treasurer?
12 Which candidates ran for election in North Bonneville who were from nei-
ther the Republican nor Democrat parties?
13 Did Valerie I. Quill select the full reporting option when she ran for the
11/03/2015 elections?
14 Which candidates from the Libertarian, Independent, or Green parties ran
for election in Seattle?
15 Did Suzanne G. Skaar ever run for City Council member?
16 In which election year did Stanley J Rumbaugh run for Superior Court
Judge?
17 In which years did Dean A. Takko run for the State Representative oce?
18 Which candidates running after 06/15/2017 were from the Libertarian
party?
19 Which reporting option did Douglas J. Fair select when he ran for district
court judge in Edmonds? Mini or full?
questions and the test set with the remaining 12. Given the low variability of
the documents layout, we ensured that in the test set there were questions which
refer to document form elds or that had some constraints that were not seen
in the sample set. In addition, as depicted in gure 2 the number of relevant
documents is quite variable among the questions, which poses another challenge
that methods will have to deal with.
Fig. 2. Number of relevant documents in ground truth for each question in the sample
set (blue) and the test set (red).
The ultimate goal of this task is the extraction of information from a collection of
documents. However, as previously demonstrated, and especially in unbalanced
datasets, models can learn that specic answers are more common to specic
questions. One of the clearest cases is the answer Yes, to questions that are
answered with Yes or No. To prevent this, we not only evaluate the answer to
the question, but also if the answer has been reasoned from the document that
contains the information to answer the question, which we consider as evidence.
Therefore, we have two dierent evaluations, one for the evidence which is based
on retrieval performance, and the other for the answer, based on text VQA
performance.
Table 2. Description of the document forms elds with a brief analysis of their vari-
ability showing the number of values and unique values in their annotations.
8 R. Tito et al.
Evidences: Following standard retrieval tasks [19] we use the Mean Average
Precision (MAP) to assess the correctness of the positive evidences provided
by the methods. We consider as positive evidences the documents in which the
answer to the question can be found.
Answers: Following other text based VQA tasks [5,6] we use the Average Nor-
malized Levenshtein Similarity (ANLS) which captures the model’s reasoning
capability while smoothly penalizing OCR recognition errors. However, in our
case the answers are a set of items for which the order is not relevant, in con-
trast to common VQA tasks where the answer is a string. Thus, we need to
adapt this metric to make it suitable to our problem. We name this adapta-
tion as Average Normalized Levenshtein Similarity for Lists (ANLSL), formally
described in equation 1. Given a question Q, the ground truth list of answers
G = {g1 , g2 . . . gM } and a model’s list predicted answers P = {p1 , p2 . . . pN }, the
ANLSL performs the Hungarian matching algorithm to obtain a k number of
pairs U = {u1 , u2 . . . uK } where K is the minimum between the ground truth
and the predicted answer lists lengths. The Hungarian matching (Ψ ) is performed
according to the Normalized Levenshtein Similarity (N LS) between each ground
truth element gj ∈ G and each prediction pi ∈ P . Once the matching is per-
formed, all the NLS scores of the uz ∈ U pairs are summed and divided for the
maximum length of both ground truth and predicted answer lists. Therefore, if
there are more or less ground truth answers than the ones predicted, the method
is penalized.
U = Ψ (N LS(G, P ))
1 ∑K
(1)
AN LSL = N LS(uz )
max(M, N ) z=1
4 Baselines
This section describes the two baselines that are employed in the experiments.
Both baselines breakdown the task into two dierent stages. First, they rank the
documents according to the condence of containing the information to answer
a given a question and then, they get the answers from the documents with the
highest condence. The rst baseline combines methods from the word spotting
and NLP Question Answering elds to retrieve the relevant documents and an-
swer the questions. We name this baseline as Text spotting + QA. In contrast,
the second baseline is an ad-hoc method specially designed for this task and
data, which consist on extracting the information from the documents and map
it in the format of key-value relations. In this sense it represents the collection
similar as databases do, for which we name this baseline as Database. These
baselines allows to appreciate the performance of two very dierent approaches.
Document Collection Visual Question Answering 9
The objective of this baseline is to set a starting performance result from the
combination of two simple but generic methods that will allow to assess the
improvement of future proposed methods.
1 ∑ |OCR|
|Q|
cd = min {N LD(qwi , rwj )} (2)
|Q| i=1 j=1
Notice that removing only stopwords is not enough, like in the question
depicted in gure 1, where the verb run is not considered as stopword, but can’t
be found in the document and consequently would be counterproductive.
Answering: Once the documents are ranked, to answer the given questions
we make use of BERT [10] question answering model. BERT is a task agnostic
language representation based on transformers [33] that can be afterwards used
in other downstream tasks. In our case, we use extractive question answering
BERT models which consist on predicting the answer as a text span from a con-
text, usually a passage or paragraph by predicting the start and end indices on
that context. Nonetheless, there is no such context in the DocCVQA documents
that encompasses all the textual information. Therefore, we follow the approach
of [23] to build this context by serializing the recognized OCR tokens on the
document images to a single string separated by spaces following a top-left to
bottom-right order. Then, following the original implementation of [10] we in-
troduce a start vector S ∈ RH and end vector E ∈ RH . The probability of a
word i being the start of the answer span is obtained as the dot product between
the BERT word embedding hidden vector Ti and S followed by a softmax over
all the words in the paragraph. The same formula is applied to compute if the
word i is the end token by replacing the start vector S with the end vector E.
Finally, the score of a candidate span from position i to position j is dened as
S · Ti + E · Tj , and the maximum scoring span where j ≥ i is used as a prediction.
10 R. Tito et al.
5 Results
5.1 Evidences
To initially assess the retrieval performance of the methods, we rst compare
two dierent commercial OCR systems that we are going to use for text spot-
ting, Google OCR [12] and Amazon Textract [2]. As reported in table 3 the
performance on text spotting with the latter OCR is better than Google OCR,
and is the only one capable of extracting the key-value relations for the database
approach. For this reason we use this as the standard OCR for the rest of the
text spotting baselines.
Compared to text spotting, the database retrieval average performance is
similar. However, as depicted in gure 3 we can appreciate that performs better
for all the questions but the number 11 where it gets a MAP of 0. This is the
result from the fact that the key-value pair extractor is not able to capture the
relation between some of the forms elds, in this case the treasurer name, and
consequently it catastrophically fails at retrieving documents with specic values
on those elds, one of the main drawbacks of such rigid methods. On the other
Document Collection Visual Question Answering 11
hand, the questions where the database approach shows a greater performance
gap are those where in order to nd the relevant documents the methods must
search not only documents with a particular value, but understand more complex
constraints such as the ones described in section 3.1, which are nding documents
between two dates (question 9), after a date (question 18), documents that do not
contain a particular value (question 12), or where several values are considered
as correct (question 14).
Fig. 3. Evidence retrieval performance of the dierent methods reported by each ques-
tion in the test set.
5.2 Answers
the model only see around 80 samples. Nonetheless, this is sucient to improve
the answering performance without harming the previous knowledge.
Given the collection nature of DocCVQA, the answer to the question usually
consists on a list of texts found in dierent documents considered as relevants.
In our case, we consider a document as relevant when the condence provided
for the retrieval method on that document is greater than a threshold. For the
text spotting methods we have xed the threshold through an empirical study
where we have found that the best threshold is 0.9. In the case of the database
approach, given that the condence provided is either 0 or 1, we consider rele-
vant all positive documents.
In the experiments we use the BERT answering baseline to answer the ques-
tions over the ranked documents from the text spotting and the database re-
trieval methods. But we only use the database method to answer the ranked
documents from the same retrieval approach. As reported in table 4 the latter
is the one that performs the best. The main reason for this is that the wrong
retrieval of the documents prevents the answering methods to nd the necessary
information to provide the correct answers. Nevertheless, the fact of having the
key-value relations allows the database method to directly output the value for
the requested eld as an answer while BERT needs to learn to extract it from
a context that has partially lost the spatial information of the recognized text
when at the time of being created, the value of a eld might not be close to
the eld name, losing the semantic connection between the key-value pair. To
showcase the answering performance upper bounds of the answering methods
we also provide their performance regardless of the retrieval system, where the
documents are ranked according to the test ground truth.
Retrieval Answering
MAP ANLSL
method method
Text spotting BERT 72.84 0.4513
Database BERT 71.06 0.5411
Database Database 71.06 0.7068
GT BERT 100.00 0.5818
GT Database 100.00 0.8473
As depicted in gure 4, BERT does not perform well when the answer are
candidate’s names (questions 8, 9, 11, 14 and 18). However, it has a better per-
formance when asking about dates (questions 16 and 17) or legislative counties
(question 10). On the other hand, the database approach is able to provide the
required answer, usually depending solely on whether the text and the key-value
relationships have been correctly recognized.
Document Collection Visual Question Answering 13
The most interesting question is the number 13, where none of the methods
are able to answer the question regardless of a correct retrieval. This question
asks if a candidate selected a specic checkbox value. The dierence here is
that the answer is No, in contrast to the sample question number 3. Then,
BERT can’t answer because it lacks of a document collection point of view, and
moreover, since it is an extractive QA method, it would require to have a No
in the document surrounded with some context that could help to identify that
word as an answer. On the other hand, the database method fails because of its
logical structure. If there is a relevant document for that question, it will nd
the eld for which the query is asking for, or will answer ’Yes’ if the question is
a Yes/No type.
This work introduces a new and challenging task to both the VQA and DAR
research elds. We presented the DocCVQA that aims to provide a new perspec-
tive to Document understanding and highlight the importance and diculty of
contemplating a whole collection of documents. We have shown the performance
of two dierent approaches. On one hand, a text spotting with an extractive
QA baseline that, although it has lower generic performance it is more generic
and could achieve similar performance in other types of documents. And on the
other hand, a baseline that represents the documents by their key-value relations
that despite achieving quite good performance, is still far from being perfect and
because of its design is very limited and can’t generalize at all when processing
other types of documents. In this regard, we believe that the next steps are to
propose a method that can reason about the whole collection in a single stage,
being able to provide the answer and the positive evidences.
14 R. Tito et al.
Acknowledgements
This work has been supported by the UAB PIF scholarship B18P0070 and the
Consolidated Research Group 2017-SGR-1783 from the Research and University
Department of the Catalan Government.
References
1. Almazán, J., Gordo, A., Fornés, A., Valveny, E.: Word spotting and recognition
with embedded attributes. IEEE transactions on pattern analysis and machine
intelligence 36(12), 2552–2566 (2014)
2. Amazon: Amazon textract (2021), https://aws.amazon.com/es/textract/, accessed
on Jan 11, 2021
3. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.:
Vqa: Visual question answering. In: Proceedings of the IEEE international confer-
ence on computer vision. pp. 2425–2433 (2015)
4. Bansal, A., Zhang, Y., Chellappa, R.: Visual question answering on image sets. In:
European Conference on Computer Vision. pp. 51–67. Springer (2020)
5. Biten, A.F., Tito, R., Maa, A., Gomez, L., Rusinol, M., Mathew, M., Jawahar, C.,
Valveny, E., Karatzas, D.: Icdar 2019 competition on scene text visual question an-
swering. In: 2019 International Conference on Document Analysis and Recognition
(ICDAR). pp. 1563–1570. IEEE (2019)
6. Biten, A.F., Tito, R., Maa, A., Gomez, L., Rusinol, M., Valveny, E., Jawahar,
C., Karatzas, D.: Scene text visual question answering. In: Proceedings of the
IEEE/CVF International Conference on Computer Vision. pp. 4291–4301 (2019)
7. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with
subword information. Transactions of the Association for Computational Linguis-
tics 5, 135–146 (2017)
8. Coüasnon, B., Lemaitre, A.: Recognition of tables and forms (2014)
9. Dengel, A.R., Klein, B.: smartx: A requirements-driven system for document
analysis and understanding. In: Lopresti, D., Hu, J., Kashi, R. (eds.) Document
Analysis Systems V. pp. 433–444. Springer Berlin Heidelberg (2002)
10. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi-
rectional transformers for language understanding. ACL (2019)
11. Fan, C., Zhang, X., Zhang, S., Wang, W., Zhang, C., Huang, H.: Heterogeneous
memory enhanced multimodal attention model for video question answering. In:
Proceedings of the IEEE/CVF Conference on CVPR. pp. 1999–2007 (2019)
12. Google: Google ocr (2020), https://cloud.google.com/solutions/document-ai, ac-
cessed on Dec 10, 2020
13. Hu, R., Singh, A., Darrell, T., Rohrbach, M.: Iterative answer prediction with
pointer-augmented multimodal transformers for textvqa. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020)
14. Hull, J.J.: A database for handwritten text recognition research. IEEE Transactions
on pattern analysis and machine intelligence 16(5), 550–554 (1994)
15. Kae, K., Price, B., Cohen, S., Kanan, C.: Dvqa: Understanding data visualizations
via question answering. In: Proceedings of the IEEE conference on computer vision
and pattern recognition. pp. 5648–5656 (2018)
16. Kahou, S.E., Michalski, V., Atkinson, A., Kádár, Á., Trischler, A., Bengio,
Y.: Figureqa: An annotated gure dataset for visual reasoning. arXiv preprint
arXiv:1710.07300 (2017)
Document Collection Visual Question Answering 15
17. Krishnan, P., Dutta, K., Jawahar, C.V.: Word spotting and recognition using deep
embedding. In: 2018 13th IAPR International Workshop on Document Analysis
Systems (DAS). pp. 1–6 (2018)
18. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and
reversals. In: Soviet physics doklady. pp. 707–710. Soviet Union (1966)
19. Liu, T.Y.: Learning to rank for information retrieval. Foundations and Trends in
Information Retrieval 3(3), 225–331 (2009)
20. Liu, X., Gao, F., Zhang, Q., Zhao, H.: Graph convolution for multimodal informa-
tion extraction from visually rich documents. In: Proceedings of the 2019 Confer-
ence of the North American Chapter on Computational Linguistics. pp. 32–39
21. Malinowski, M., Fritz, M.: A multi-world approach to question answering about
real-world scenes based on uncertain input. arXiv preprint arXiv:1410.0210 (2014)
22. Manmatha, R., Croft, W.: Word spotting: Indexing handwritten archives. Intelli-
gent Multimedia Information Retrieval Collection pp. 43–64 (1997)
23. Mathew, M., Karatzas, D., Jawahar, C.: Docvqa: A dataset for vqa on document
images. In: Proceedings of the IEEE/CVF WACV. pp. 2200–2209 (2021)
24. Mathew, M., Tito, R., Karatzas, D., Manmatha, R., Jawahar, C.: Document visual
question answering challenge 2020. arXiv e-prints pp. arXiv–2008 (2020)
25. Mouchère, H., Viard-Gaudin, C., Zanibbi, R., Garain, U.: Icfhr2016 crohme: Com-
petition on recognition of online handwritten mathematical expressions. In: 2016
15th International Conference on Frontiers in Handwriting Recognition (ICFHR)
26. Palm, R.B., Winther, O., Laws, F.: Cloudscan - a conguration-free invoice anal-
ysis system using recurrent neural networks. In: 2017 14th IAPR International
Conference on Document Analysis and Recognition (ICDAR). pp. 406–413
27. Rath, T.M., Manmatha, R.: Word image matching using dynamic time warping.
In: 2003 IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, 2003. Proceedings. vol. 2, pp. II–II. IEEE (2003)
28. Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question
answering. arXiv preprint arXiv:1505.02074 (2015)
29. Schuster, D., et al.: Intellix – end-user trained information extraction for document
archiving. In: 2013 12th ICDAR
30. Siegel, N., Horvitz, Z., Levin, R., Divvala, S., Farhadi, A.: Figureseer: Parsing
result-gures in research papers. In: ECCV. pp. 664–680. Springer (2016)
31. Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., Batra, D., Parikh,
D., Rohrbach, M.: Towards vqa models that can read. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition
32. Sudholt, S., Fink, G.A.: Evaluating word string embeddings and loss functions
for cnn-based word spotting. In: 2017 14th IAPR International Conference on
Document Analysis and Recognition (ICDAR). vol. 01, pp. 493–498 (2017)
33. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
L., Polosukhin, I.: Attention is all you need. In: NIPS (2017)
34. Wilkinson, T., Lindström, J., Brun, A.: Neural ctrl-f: Segmentation-free query-by-
string word spotting in handwritten manuscript collections. In: 2017 IEEE Inter-
national Conference on Computer Vision (ICCV). pp. 4443–4452 (2017)
35. Wolf, T., et al.: Transformers: State-of-the-art natural language processing. In:
Proceedings of the 2020 Conference on Empirical Methods in Natural Language
Processing: System Demonstrations. Association for Computational Linguistics
36. Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: Layoutlm: Pre-training of
text and layout for document image understanding. p. 1192–1200. KDD ’20 (2020)
37. Zhang, P., et al.: TRIE: End-to-End Text Reading and Information Extraction for
Document Understanding, p. 1413–1422 (2020)