0% found this document useful (0 votes)
71 views23 pages

Screenqa: Large-Scale Question-Answer Pairs Over Mobile App Screenshots

Large-Scale Question-Answer Pairs over Mobile App Screenshots

Uploaded by

robertdice32
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views23 pages

Screenqa: Large-Scale Question-Answer Pairs Over Mobile App Screenshots

Large-Scale Question-Answer Pairs over Mobile App Screenshots

Uploaded by

robertdice32
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

ScreenQA: Large-Scale Question-Answer Pairs

Over Mobile App Screenshots

Yu-Chung Hsiao∗ † Fedir Zubach∗ Gilles Baechler Victor Cărbune Jason Lin

Maria Wang Srinivas Sunkara Yun Zhu Jindong Chen


arXiv:2209.08199v3 [cs.CL] 30 Jul 2024

Google DeepMind
[email protected]

Abstract

We present a new benchmark and dataset, ScreenQA, for screen content under-
standing via question answering. The existing screen datasets are focused either on
structure and component-level understanding, or on a much higher-level composite
task such as navigation and task completion. We attempt to bridge the gap between
these two by annotating 86K question-answer pairs over the RICO dataset in hope
to benchmark the screen reading comprehension capacity. This work is also the
first to annotate answers for different application scenarios, including both full
sentences and short forms, as well as supporting UI contents on screen and their
bounding boxes. With the rich annotation, we discuss and define the evaluation
metrics of the benchmark, show applications of the dataset, and provide a few
baselines using closed and open source models.

1 Introduction
Recent advent of machine learning, especially Visual Large Language Models (VLMs), has motivated
a long list of applications that are based on mobile screens. To name a few, a personal agent for users
to use mobile device hands-free or eyes-free, code generation from UI design mock-ups, adaptation
of device UI, automatic ads generation, testing and criticising mobile apps, all require advanced
technology in understanding or annotation of the mobile screens.
Mobile app screenshots have been analyzed using machine learning from multiple aspects. These
analyses range from pixel level understanding, e.g., layout structural analyses, UI issue detection
and correction (Li et al., 2022), to UI element semantics, e.g., icon recognition, button action
prediction (Sunkara et al., 2022), to even higher-level functional analyses such as accessibility
support (Li et al., 2020b), screen description (Wang et al., 2021), and screen type classification (Deka
et al., 2017). Comparatively, the content understanding aspect is understudied. Examples of content
include star ratings from restaurant reviews, flight status, and messages from chats. Having this
capacity of understanding is important for two reasons: First, most user activities are information
seeking. Second, transactional task (e.g. book a flight) requires precise understanding of the UI
content including actionable ones, and their state. In this work, we advocate to use only screenshot as
the sole representation of UI screens, as the underneath structure representation could be messy. With
this setting, it is important to devise a task to measure quantitatively information detection, extraction
and understanding from screenshot images.

co-first authors with equal contributions.

work done while at Google.

Preprint. Under review.


To this end, we annotated the RICO dataset (Deka et al., 2017) with 85,984 question-answer pairs,
referred to as ScreenQA, and released the dataset in the public domain.1 In each pair, the answer
is to the question regarding a screenshot. Each answer is a 4-tuple, a short answer (e.g. a phrase),
a full sentence answer, a list of UI contents supporting the answer, and a list of bounding boxes
of the contents. See examples in Figure 1. The components in the answers, individually or in
combination, serve different purposes in metrics for tasks that are motivated by real-life applications,
for examples, short answers for information extraction, long answers for summarization, UI content
for textual grounding in entity extraction, and bounding boxes for location grounding. To the best of
our knowledge, this is the first large-scale questions answering dataset over mobile app screenshots,
and the first one to be publicly available. Inspired by the SQuAD dataset (Rajpurkar et al., 2016a), we
hope to encourage the community to advance technologies toward better screen content understanding,
which will benefit beyond itself and human computer interaction (HCI) community.
The contributions of this work are the following.
1. We create and release ScreenQA, a large-scale question answering dataset and benchmark
for mobile screens, as the first of its kind.
2. ScreenQA is also the only dataset with rich annotations, including short answers, full
sentence answers, supporting UI content, and their bounding boxes.
3. Based on the rich annotations, we define 4 tasks and corresponding metrics motivated by
the need to measure quality in real-life applications.
4. We build 2 baselines with closed and open source models (ScreenAI 5B (Baechler et al.,
2024), PaLIGemma (development contributors et al., 2024)), and release open source model
checkpoints.

2 Related work

UI graphics are typically designed to be informative and actionable for users. Existing datasets can
be categorized by their associated tasks, which we outline below.

Screen and UI-based Understanding In contrast to natural images, screen images are composed
of structured components generated by a tree-like syntax of view hierarchy (VH) or Document Object
Models (DOM). Previous works identify elements, i.e. icon detection (Deka et al., 2017), widget
captioning via language (Li et al., 2020b), and referring expression in various classification (Wu
et al., 2023), retrieval (Bai et al., 2021) and generation (Hong et al., 2023) setups. Meanwhile others
specialize in challenging variations i.e. aspect ratio, sizes, OCR. In developing web agents, MoTIF
(Burns et al., 2022) and VisualWebArena (Koh et al., 2024) provide interactive app environments for
evaluating visually grounded screen agents. Rather than high-level goal-completion, our focus is on
content understanding of language heavy UI in the broader context for human interaction.

Multimodal Question Answering Question answering tasks can be categorized by 1) open- or


closed-domain and capacities to evaluate ranging from reading comprehension, multi-hop reasoning
and logic reasoning. ScreenQA is a closed-domain question answering task that expects answers by
span (or UI element phrase) selection for screen reading comprehension. As described in Section 2,
we instructed the data annotators to avoid multi-hop, mathematical counting, and logic reasoning,
in order to focus on the fundamental screen comprehension capacity. With the flexibility of both
short and long-form answers, along with grounded bounding boxes in a Question-Answering format,
ScreenQA natively aligns with most LLMs’ retraining recipe.

Document Image Understanding Given Vision-Language Models (VLMs)’s limitations in count-


ing and logical reasoning, recent datasets like InfographicQA (Mathew et al., 2022), ChartQA (Masry
et al., 2022) are created. Mobile app screenshots contain nearly all possible representation of infor-
mation, specifically text, blended with icons, symbols, and images through pixels. This makes it
similar to the understanding of scanned or photographed documents. DocVQA (Mathew et al., 2021)
uses an extractive QA format for span/segment extraction. Along with TextVQA and several other
domain-specific datasets (Mishra et al., 2019; Kahou et al., 2017) with applications i.e. for textbooks

1
ScreenQA dataset is released at https://github.com/google-research-datasets/screen_qa.

2
(a) Question: “How many likes (b) Question with Ambiguity (c) Question with no answer:
and comments for the post “Why (shown in red): “What’s the tem- “What is the date of version
Michael Flynn...”? perature on Saturday?” 1.3.1?”

Figure 1: ScreenQA examples. (a) Short Answer: "1,1", Full Answer: "There is 1 like and 1
comment", UI Content: "1","1", Bounding boxes are shown in green.

(Singh et al., 2019), math problems, receipts, we believe that techniques developed for relating the
2D arrangement of text are applicable to screens.

3 Data annotation

We perform several steps to collect the ScreenQA annotations, as depicted in Figure 2. Each step is
described below. See Appendix C for collected question-answer data examples.

3.1 Pre-filtering

The pre-filtering stage filters out 1) screenshots from non-English apps (different than “non-English
screenshots”, as translation and dictionary apps could cause confusion) and 2) screenshots whose
view hierarchies (VHs) are out of sync with the main contents. It is a known issue that in the RICO
dataset, some screenshots and their corresponding view hierarchies are not perfectly synchronized:
there exists certain time difference between view hierarchy extraction and screenshot capturing (Zang
et al., 2021). We remove those screenshots to ensure that all ScreenQA annotations are not subject to
such data noises.
Classifying the sync quality is tricky, even for human readers. One may not be able to differentiate
between occlusion, ghosting, and the actual out-of-sync. See Figure 3 for examples. Accordingly, we
instructed the annotators to focus on the main content area of the screen and make sure the bounding
boxes in that area are not corrupted, as this is where most contents of interest and questions come
from.
We use 27 annotators to perform this step. Among RICO’s 66K unique screenshots, about 11K
screenshots are from non-English apps, and about 13K screenshots have out-of-sync view hierarchies.
This out-of-sync number is different from (Li et al., 2020a) because we focus on the main content
area. After filtering, we are left with about 51K screenshots from English apps with in-sync VHs.

3
(a) VH with oc- (b) Ghosting VH (c) Out-of-sync VH
cluded elements. from menu. for main content.

Figure 3: View hierarchies (VHs) are overlaid on the screenshots


with class names and the first few characters printed to assist
Figure 2: ScreenQA annotation annotators to determine whether the VHs for the main contents
process. are in sync.

3.2 Question annotation

For question annotation, we asked the annotators to frame questions given a screenshot as the context.
The annotators were expected to compose natural, daily-life questions as if using the app. The
composed questions should inquire information that can be directly read from the screen and should
not require logical reasoning, counting, calculation, or mathematical comparison, etc. We further
required the annotators not to ask questions about any advertisement on the screen.
The annotation UI tool is depicted in Appendix A.1. We ask the annotators to compose up to five
questions given a screenshot in the first pass. In the second pass, we ask for up to three questions
given a screenshot and the questions previously composed. Each pass involve one annotator for
each screenshot and whoever annotated the screenshot before is excluded from being assigned to the
same screenshot. This ensures that every screenshot is assigned precisely two annotators to compose
questions. We choose this sequential process to avoid tricky deduplication of similar questions, and
to encourage annotators to diversify their questions. Note that the same set of annotators are involved
in both passes such that each annotator has an opportunity to develop their own question style in the
first pass before seeing others’ questions in the second pass. This ensures that we still have certain
numbers of question styles in the dataset before they converge to each other in repeated passes.
We again involved the 27 annotators. The first pass of question annotation generate 46K questions,
and the second pass add an additional 36K questions, resulting in 82K questions in total. Around
15K screenshots are left with no question, due to a lack of interesting content.

3.3 Answer annotation

We use the total 82K questions with 35K distinct screenshots from the previous two-pass question
annotation step to further annotate the corresponding answers. The annotator who composed the
question is excluded from annotating their own answer to avoid potential biases. Our answer
annotation UI tool is shown in Appendix A.2.
Given an example, which contains a screenshot and a question, the annotators are tasked to
1. Fix any grammatical errors or typos of the given question without altering its intention.
2. Answer the question, based on the context of the given screenshot, by 1) selecting bounding
boxes from the underlying view hierarchy leaf nodes that contain the relevant answers, or
drawing bounding boxes if no suitable leaf nodes can be used, and 2) ranking the answers in
descending order of relevance if applicable, or by the common reading order.
3. Additionally also provide a full-sentence answer to the question.
4. Consider two exceptions: 1) The question may be incomprehensible or 2) the screenshot
does not contain the answer to the question, due to the questioner’s lack of understanding

4
Screenshots Questions
Train 28, 378 68, 951
. Validation 3, 485 8, 614
Test 3, 489 8, 419
Total 35, 352 85, 984

Figure 4: Distribution of question answerability. Figure 5: Dataset split stats.

of the app. Then the example should be marked as “invalid question” and “not answerable
from the screenshot”, respectively.
5. One answer is annotated for the train split, and three for the validation and the test splits.
This is to improve the evaluation quality. More details on the data split details are provided
in Section 4.1.
The “invalid question” annotations are then filtered out, and the questions that have no other answer
annotations are excluded from the overall ScreenQA dataset, as they are considered incorrectly
annotated during the question annotation phase.

3.4 Not-answerable question annotations

The questions marked as “not answerable from the screenshot” represent a special category of
questions that check model overtriggering (attempting to answer those which are not supposed to
be answered). Being able to come to a conclusion that the answer is not present on the screen is an
important aspect of screen understanding. Note that it is possible that one annotator considered a
question to be not answerable, and another provided an answer to that same question.
As described in Section 3.2, the first two passes of question annotations aimed to compose questions
that can be answered from the screen, so as expected, the fraction of not answerable questions was
small. We then had a third pass of question annotation to raise this fraction to nearly 10%, see
Figure 4. For this, we used nearly 5K screenshots selected randomly from those where there were
no such questions yet. In this pass, we asked annotators for exactly one additional question per
screenshot that had some relation to the information there, but could not be answered. See example
in Figure 1c. Answer annotation was not used for these 5K questions.

3.5 Short answers generation

One may argue that the exact UI elements containing the answer to a user’s question are not directly
utilizable by the user, as it is not always straightforward to convert it to the answer, which is the only
important thing in many such scenarios. With that intention an alternative answer information was
produced for each question: a short answer.
There are many ways to represent the same information. For example, “25.01.2023”, “25th of January
2023” and “January 25, 2023” represent the same date, and the model should not be penalized for
choosing one over the others. To allow this flexibility, multiple answers were produced per question,
covering various representations of the same factual answer. The ground truth for each question from
this ScreenQA dataset is therefore a list of possible short answers.
A version of the PaLM 2 model (Anil et al., 2023) was used to generate this list of short answers
in a few-shot setting. Textual information of the ScreenQA dataset (question, list of UI element
descriptions and full-sentence answer) was used as input. See Appendix B for details about the
prompts used. The generated lists were then verified by simple heuristics and eyeballing of randomly
selected samples.

5
Table 1: Top (≥ 1.0%) question category distribution and examples. Please see Appendix D for more
categories.
Category % Examples
UI selection & config 18.1 Which option is selected? What is the selected ringtone?
Quantity number 11.7 How many unread messages? How many pictures are there in Western Europe?
App name 10.4 What is the name of the application? What is the app name?
Date time 9.4 When was “Heal the Living” released? When is happy hour?
Price 3.4 How much is the gift bonus in 3rd place? What is the price?
Name of item 3.3 What is the name of the drug? What is the name of chef?
User name 2.8 What is the name of the user? What is the username on telegram?
Duration 2.5 What is the duration of video? How long is the song?
Enum. of avail. options 2.5 Which social media options are given there? What are the options available for logging in?
Address and direction 2.4 What is the current location? What is the service zip code?
Email address 2.4 What is an email address? What is customer service email?
Person’s name 2.1 Who sang the song? What is the last name?
Signup/login 1.6 Which application can be used to sign up / login? What are the alternative choices for signing up?
Version information 1.6 What is the version number? What is the new feature in version v3.1.3?
Weather 1.5 What is the range of temperature shown on Sunday? What is the weather forecast for Sunday?
Score & value 1.4 What is height/weight of the person? What is the score?
Yes/No 1.1 Is there any travel plans? Is there any favorite?
Phone number 1.0 What is the phone number? What is the prefix for the international mobile number?
Others 20.8 What’s the average speed? What is the user’s middle initial
What is the spending limit? Which team has 41 points?

(a) Number of composed questions per screenshot. (b) Number of bounding boxes used to answer the
question.
Figure 6: Histograms for number of composed questions and number of bounding boxes in answers.
a) The three question annotation passes were capped at five, three and one questions, respectively,
resulting in a maximum of nine questions in total. b) The cases with no answer or a single bounding
box amount for 91-92% of the answers, they have been removed from the chart in favor of more
clarity on the long tail. Answers with 10 or more bounding boxes amount for less than 0.15%.

4 Dataset analysis

4.1 Dataset statistics

The ScreenQA dataset contains 85,984 questions from 35,352 distinct screenshots. It is split into
train, validation and test sets in an approximately 80-10-10 ratio, see Table 5. Note that questions
related to same screenshot belong to the same split.

4.2 Question analysis

Among the 86K collected questions, there are 47.5K unique questions. It is natural and valid to ask
the same common questions over various screenshots, for example, “Which option is selected on the
screen?” and “What is the email address?”. Some screenshots receive more questions because they
usually contain more information to be asked about. Yet, the histogram still exhibits a reasonable
exponential decay with a mild slope, as depicted in Figure 6a.
To further understand what questions have been asked, we categorize the questions using regular
expressions based on a list of empirically determined question categories. The categories are meant

6
to provide a rough overview of the question annotations and by no means to provide a precise
categorization. The distribution and examples by these categories are shown in Table 1. Note that the
questions were not composed at the annotators’ full discretion: They are conditioned on the given
screenshots. That is to say, the distribution is implicitly influenced by the RICO crawling process.
For example, as RICO crawled screen traces from freshly installed apps and did not login an account,
a noticeable number of the screen traces end at a login page. This in turn translates to a higher
percentage of questions asked about app names, email addresses, permissions to login, etc.

4.3 Answer analysis

We analyze the answer annotations in two aspects: 1) How often do we need more than one bounding
box and its text to answer the question, and 2) How often do human annotators find the view hierarchy
useful to provide a minimal answer to the question.
Figure 6b illustrates the histogram of number of bounding boxes used in each answer. About 84%
of answers contain a single bounding box. Among these single-bounding-box answers, 51% uses a
VH leaf node directly, while 49% uses a manually drawn bounding box. If we consider all answers
together, 51% answers are only based on VH leaf nodes, while 48% uses only manually drawn
bounding boxes. Interestingly, for 0.8% of the answers, the human annotators used a mixture of VH
leaf nodes and manually drawn bounding boxes. These cases usually happen 1) when the answer
is an enumeration of “inhomogeneous” options that are organized differently on the screen, such
as using email vs. other APIs to login, and 2) when an answer needs multiple parts to be complete,
such as a date consisting of year, month, and day scattered on the calendar UI, and a temperature
or a measurement requiring a number followed by the corresponding unit. These parts may not be
displayed in the same way, resulting in a lack of useful VH leaf nodes for some of the parts.
Human raters preferred to draw the bounding boxes in about half of the cases; this reflects that the
view hierarchy might not necessarily be a very reliable input for ScreenQA.

5 Applications: Tasks and Metrics

We design the data collection guidelines considering several real world applications. In this section,
we define metrics accordingly for training and evaluation of models.
Because we collect data from multiple raters for each question in validation and test splits, the metrics
accommodate multiple ground truths. We compute an average of the max metric value over all ground
PN
truth variants for a given question as avg(metric) = N1 i=1 maxj [metric(Ai , Agi,j )], where N is
the number of questions, Ai is the predicted answer for i-th question, and Agi,j is the j-th ground
truth for i-th question.

ScreenQA: Short (SQA-S) Given a screenshot and a question, output a short (concise) answer
to this question using the information presented on the screen. If the screenshot doesn’t contain the
answer, output “<no answer>”.
We highlight this task as one of the key capabilities our dataset enables.
Since there are many ways to represent the same information in text, we produced a list of plausible
short answers to be used as ground truth here (see Section 3.5). The two metrics we propose for this
task are Exact Match (EM)—to verify answers composed of shorter answers—and F1-Score—to
handle acceptable modifications in longer answers, e.g. permutations or rephrasing of quoted content.
We apply SQuAD (Rajpurkar et al., 2016b) pre-processing before computing averaged metrics.

ScreenQA: Long (SQA-L) Given a screenshot and a question, output a long (full-sentence) answer
to this question using the information presented on the screen. If the screenshot doesn’t contain the
answer, output “<no answer>”.
The role of this task is to enable models to output fluent answers that can be directly conveyed to a
human, e.g. by an Assistant. Oftentimes, the task resembles a summary of the elements that constitute
the answer to the given question. Summary evaluation metrics frequently capture these aspects and
therefore we recommend using ROUGE-{1,2,L} (Lin, 2004).

7
ScreenQA: UI Content (SQA-UIC) Given a screenshot and a question, output a list of UI elements
that contain the answer, where each element is represented by its text representation. If the screenshot
doesn’t contain the answer, output an empty list. Section 3 describes what UI elements correspond
to a question, in which order they should be provided, and what is considered as contents of an UI
element.
With the exception of some icons having pre-defined textual descriptions, in most cases this content is
text within the UI element, which resembles that of OCR systems (Qin et al., 2019). The difference,
however, is that the output cannot be treated as a continuous sequence of symbols or words. It should
be evaluated as a list.
Given elements are lists, we make use of Exact Match and F1-score as metrics. The text present in
screenshots is extracted more easily than in arbitrary images and therefore we perform element-wise
matching without additional pre-processing.

ScreenQA: UI Content with Bounding Boxes (SQA-UIC-BB) Given a screenshot and a question,
output a list of UI elements that contain the answer, where each element is represented by its bounding
box and text representation.
We consider this an extension to the previous SQA-UIC. The exact localization of relevant UI elements
allows highlighting the UI elements that contain the answer to user’s question, as well as performing
action automation etc. Detecting bounding boxes, particularly on screen contents, is rather rarely
available in existing datasets. It also stretches the model’s capabilities. For this task, we recommend
evaluating the bounding box detection quality using F1-Score, where two bounding boxes match if
their Intersection over Union (IoU) (Rezatofighi et al., 2019) score is higher than 0.1. In addition, the
recognized text representation can be evaluated for each match using Exact match and F1-score.
The rather low threshold is justified because bounding boxes in the dataset are from two sources: VH
and manually drawn. VH bounding boxes tend to be big, capturing significant amount of no-content
area around the ground truth content. Meanwhile, the manually drawn ones are usually very well fit
to the content. When different approaches are used for the same UI element annotation by different
raters, their IoU can be very small.

Table 2: Examples of answer formats for different tasks.

Question SQA-S answer SQA-L answer SQA-UIC answer


What’s the time? 10:00 a.m. The time is 10 a.m. [‘10:00’, ‘AM’]
What is the date? 05.06.2024 The date is June 5, 2024. [‘05’, ‘06’, ‘2024’]

See Table 2 for a visualization of the proposed tasks.

6 Baselines
In this section, we present baseline model performance on the tasks introduced in Section 5. The
introduced metrics capture several dimensions of model performance and what our dataset enables in
terms of downstream applications. We encourage additional research for both improving performance
on these tasks, as well as developing additional tasks that leverage the rich screen annotations.

6.1 Experimental Setup: Models

The previously introduced tasks are using the same train, validation and test splits. For each task, we
report fine-tuning quality on the described inputs and outputs. Each experiment is ran individually on
the two models, which we describe in further detail below.

PaliGemma 3B The recently introduced VLM leverages the SigLIP loss from PaLI-3 (Chen et al.,
2023), while building on top of the Gemma model (Mesnard et al., 2024) as the language backbone.
In our experiments we make use of the pre-trained 3B model checkpoint with 896 × 896 resolution.
We follow the standard fine-tuning process available publicly. Fine-tuning runs for 10 epochs with a
learning rate 1e − 5 using adam optimizer with cosine decay schedule. Both vision and language
backbones are trained during fine-tuning.

8
Table 3: Performance of fine-tuned models on proposed task types. Bold is best performance.

SQA-S SQA-L SQA-UIC SQA-UIC-BB


EM F1 R-1 R-2 R-L EM F1 BBOX-F1 EM F1
ScreenAI 5B 90.7 94.6 92.6 87.4 91.9 87.0 88.7 94.2 84.0 85.7
PaliGemma 3B 896 89.4 93.2 90.9 85.3 90.1 86.1 87.8 88.8 78.8 81.2

ScreenAI 5B This best-in-class VLM specializes in UI and infographics understanding (Baechler


et al., 2024) with a dynamic 812 × 812 resolution. Similarly, it builds on the PaLI-3 (Chen et al.,
2023) architecture, however using pix2struct patching and a different pre-training and fine-tuning
dataset mixture. The model is therefore better equipped to work with UI elements. Fine-tuning
runs until convergence, using a learning rate of 1e − 3. The vision backbone is frozen and only the
language backbone is trained during fine-tuning.
The choice of the models is motivated by their strong performance on document and screen under-
standing tasks, as well due to ease of portability and reproducibility given the rather small parameter
count, compared to today’s vision-language models. Nonetheless, we report performance of additional
variants of these models and larger ones in Appendix E.

6.2 Experimental Results

We report our learnings in Table 3. We note the slightly higher performance of ScreenAI and attribute
it to its larger model capacity and specialized pre-training mixture that includes a richer variety of UI
elements. PaliGemma performance is however very competitive, and by specializing both modality
backbones we enable better use of the entire model capacity. Task metrics introduced in Section 5
measure performance of models in extracting relevant information for answering a question, ability
to provide fluent answers and identify relevant UI elements through their bounding box coordinates.
We noticed the first two capabilities correlate with a VLM’s reasoning capacity, and found zero-shot
evaluation on open source models of similar size to be much lower in performance (see Appendix E).
Comparing ScreenAI and PaliGemma, we observe that ScreenAI-5B results are 1-2% higher across
metrics on SQA-S, SQA-L and SQA-UIC. The difference is even more noticeable for SQA-UIC-BB,
with ScreenAI 5.4% higher in BBOX-F1 and 5.2% higher in EM. It appears that while PaliGemma is
as good or maybe better at reasoning (counting), understanding UI elements (UI content prediction),
it is worse at bounding box localization, and interpreting the question. This could be in part due to its
newer Gemma language backbone with more general pre-training, including math datasets. For more
qualitative comparisons, please see the Appendix E.

7 Conclusion
We introduced ScreenQA, a rich dataset that enables training and evaluating models on question-
answering tasks on screen content. We described the annotation process, statistics of the collected
dataset, which contains 85,984 question-answer pairs. In addition to answers, our dataset contains
extensive annotations of the UI elements, enabling the ability to train or probe models for their
holistic understanding of the screen, a necessary capability for high-level reasoning and automation
using UI interfaces. Compared to other vision-language tasks, such as document understanding or
visual-question answering, the four constructed tasks on the ScreenQA dataset pose unique challenges:
rich in text, diverse in mobile applications, blended with icons and symbols. The tasks not only
evaluate content quality, but also UI element identification quality. Furthermore, we provided initial
results on two model flavors, ScreenAI and PaliGemma, which are best in their parameter class on
general document and screen understanding tasks. We further encourage the community to tackle
screen content understanding challenges present in our benchmark, to enable new technologies and
user experiences.

References
Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos,
Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. 2023. PaLM 2 technical

9
report. arXiv preprint arXiv:2305.10403.
Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor
Cărbune, Jason Lin, Jindong Chen, and Abhanshu Sharma. 2024. Screenai: A vision-language
model for ui and infographics understanding.
Chongyang Bai, Xiaoxue Zang, Ying Xu, Srinivas Sunkara, Abhinav Rastogi, Jindong Chen, and
Blaise Aguera y Arcas. 2021. UIBert: Learning generic multimodal representations for UI
understanding.
Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and
Sağnak Taşırlar. 2023. Introducing our multimodal models.
Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate Saenko, and Bryan A. Plummer.
2022. A dataset for interactive vision language navigation with unknown command feasibility. In
European Conference on Computer Vision (ECCV).
Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil
Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, Daniel Salz, Xi Xiong,
Daniel Vlasic, Filip Pavetic, Keran Rong, Tianli Yu, Daniel Keysers, Xiaohua Zhai, and Radu
Soricut. 2023. Pali-3 vision language models: Smaller, faster, stronger.
Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hibschman, Daniel Afergan, Yang Li, Jeffrey
Nichols, and Ranjitha Kumar. 2017. Rico: A Mobile App Dataset for Building Data-Driven Design
Applications. In Proceedings of the 30th Annual ACM Symposium on User Interface Software
and Technology, UIST ’17, pages 845–854, New York, NY, USA. Association for Computing
Machinery.
Model development contributors, Lucas Beyer*, Andreas Steiner*, André Susano Pinto*, Alexander
Kolesnikov*, Xiao Wang*, Xiaohua Zhai*, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin,
and et al. 2024. PaliGemma.
Gemini Team Google. 2023. Gemini: A family of highly capable multimodal models. arXiv preprint
arXiv:2312.11805.
Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan
Wang, Yuxiao Dong, Ming Ding, et al. 2023. Cogagent: A visual language model for gui agents.
arXiv preprint arXiv:2312.08914.

Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkinson, Ákos Kádár, Adam Trischler, and
Yoshua Bengio. 2017. Figureqa: An annotated figure dataset for visual reasoning. arXiv preprint
arXiv:1710.07300.
Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham
Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. 2024. Visualwebarena: Evaluating
multimodal agents on realistic visual web tasks. arXiv preprint arXiv:2401.13649.
Gang Li, Gilles Baechler, Manuel Tragut, and Yang Li. 2022. Learning to Denoise Raw Mobile UI
Layouts for Improving Datasets at Scale. In CHI Conference on Human Factors in Computing
Systems, pages 1–13, New Orleans LA USA. ACM.
Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason Baldridge. 2020a. Mapping Natural
Language Instructions to Mobile UI Action Sequences. In Proceedings of the 58th Annual Meeting
of the Association for Computational Linguistics, pages 8198–8210, Online. Association for
Computational Linguistics.
Yang Li, Gang Li, Luheng He, Jingjie Zheng, Hong Li, and Zhiwei Guan. 2020b. Widget Captioning:
Generating Natural Language Description for Mobile User Interface Elements. In Proceedings
of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages
5495–5510, Online. Association for Computational Linguistics.
Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summa-
rization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.

10
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. 2022. Chartqa: A
benchmark for question answering about charts with visual and logical reasoning. arXiv preprint
arXiv:2203.10244.
Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar.
2022. Infographicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of
Computer Vision, pages 1697–1706.
Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. 2021. Docvqa: A dataset for vqa on
document images. In Proceedings of the IEEE/CVF winter conference on applications of computer
vision, pages 2200–2209.
Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre,
Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Pier Giuseppe
Sessa, et al. 2024. Gemma: Open models based on gemini research and technology.
Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. 2019. Ocr-vqa:
Visual question answering by reading text in images. In 2019 international conference on document
analysis and recognition (ICDAR), pages 947–952. IEEE.
OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni
Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor
Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian,
Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny
Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks,
Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea
Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen,
Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung,
Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch,
Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty
Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte,
Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel
Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua
Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike
Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon
Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne
Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo
Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar,
Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik
Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich,
Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy
Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie
Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini,
Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne,
Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David
Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie
Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély,
Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo
Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano,
Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng,
Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto,
Michael, Pokorny, Michelle Pokrass, Vitchyr H. Pong, Tolly Powell, Alethea Power, Boris Power,
Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis
Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted
Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel
Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon
Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky,
Yang Song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang,
Nikolas Tezak, Madeleine B. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston
Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya,
Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason

11
Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff,
Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu,
Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba,
Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang,
William Zhuk, and Barret Zoph. 2024. Gpt-4 technical report.
Siyang Qin, Alessandro Bissacco, Michalis Raptis, Yasuhisa Fujii, and Ying Xiao. 2019. Towards
unconstrained end-to-end text spotting. In Proceedings of the IEEE/CVF International Conference
on Computer Vision (ICCV).
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016a. SQuAD: 100,000+
Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on
Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas. Association
for Computational Linguistics.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016b. SQuAD: 100,000+
questions for machine comprehension of text.
Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese.
2019. Generalized intersection over union: A metric and a loss for bounding box regression.
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh,
and Marcus Rohrbach. 2019. Towards vqa models that can read. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition, pages 8317–8326.
Srinivas K. Sunkara, Maria Wang, Lijuan Liu, Gilles Baechler, Yu-Chung Hsiao, Jindong Chen,
Abhanshu Sharma, and James W. Stout. 2022. Towards Better Semantic Understanding of Mobile
Interfaces. In Proceedings of the 30th International Conference on Computational Linguistics,
Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Bryan Wang, Gang Li, Xin Zhou, Zhourong Chen, Tovi Grossman, and Yang Li. 2021. Screen2Words:
Automatic Mobile UI Summarization with Multimodal Learning. In The 34th Annual ACM
Symposium on User Interface Software and Technology, UIST ’21, pages 498–510, New York, NY,
USA. Association for Computing Machinery.
Jason Wu, Siyan Wang, Siman Shen, Yi-Hao Peng, Jeffrey Nichols, and Jeffrey P Bigham. 2023.
Webui: A dataset for enhancing visual ui understanding with web semantics. In Proceedings of the
2023 CHI Conference on Human Factors in Computing Systems, pages 1–14.
Xiaoxue Zang, Ying Xu, and Jindong Chen. 2021. Multimodal icon annotation for mobile applications.
In Proceedings of the 23rd International Conference on Mobile Human-Computer Interaction,
pages 1–11.

12
Figure 7: Question annotation interface.

A Data annotation interfaces for question and answer collection

A.1 Question annotation interface

The question annotation interface is shown in Figure 7. Question annotation was performed in a
sequential manner by multiple raters. An annotator can see all previous questions to diversify question
framing and avoid duplication. We also used the same sequential process to provide more feedback
and training to the annotators for quality improvement.

A.2 Answer annotation interface

The answer annotation interface is shown in Figure 8. Answer annotators were tasked to determine if
the question is valid and if the question is answerable from the screen context. If both are positive,
the annotators need to answer the questions by 1) selecting or drawing the bounding boxes of UI
elements, 2) filling the text for each selected/drawn bounding box on right right, 3) ranking them
appropriately, 4) providing the full-sentence answer to the question. The annotators were also tasked
to review and make necessary corrections if the question has grammatical errors or typos.

B ScreenQA short answers generation prompts

We describe below the prompts used in a version of the PaLM 2 model (Anil et al., 2023) to generate
short answers for ScreenQA.
Text information from the ScreenQA dataset (question, list of UI elements descriptions and full-
sentence answer) was the input, see B.1 and B.2 for details about the used prompts.

13
Annotators can correct
errors in the given
question, but are asked not
to alter the intention.

Added UI elements can be


easily re-ranked using
arrows, or removed.

UI elements can be
selected from available
View Hierarchy nodes, or
drawn manually.

Figure 8: Answer annotation interface.

B.1 For answers contained in a single UI element

L i s t v a r i o u s ways t o r e p h r a s e t h e a n s w e r . The a n s w e r s h o u l d be a s s h o r t
a s p o s s i b l e , w i t h o u t e x t r a words from t h e q u e s t i o n . Use a l l p r o v i d e d
elements in each answer . Provide t h e o u t p u t in s qu are b r a c k e t s .

Here i s an e x a m p l e :
Q u e s t i o n : ’ What ’ s t h e p e r c e n t a g e o f h u m i d i t y ? ’
Answer e l e m e n t s : [ ’65%
F u l l a n s w e r : ’ The h u m i d i t y i s 65%
R e p h r a s e s : [ ’65%

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’ What i s t h e g e n d e r ? ’
Answer e l e m e n t s : [ ’ Male ’ ]
F u l l a n s w e r : ’ The g e n d e r i s male . ’
R e p h r a s e s : [ ’ male ’ ]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’ What i s t h e s t a t u s o f "24 h r c l o c k " ? ’
Answer e l e m e n t s : [ ’ on ’ ]
F u l l a n s w e r : ’ The s t a t u s i s " on " . ’
R e p h r a s e s : [ ’ on ’ , ’ e n a b l e d ’ ]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’ What i s t h e a g e l i m i t f o r t h e p r o f i l e ? ’
Answer e l e m e n t s : [ ’ 1 8 + ’ ]
F u l l a n s w e r : ’ The a g e l i m i t o f t h e p r o f i l e i s 18 y e a r s o r o l d e r . ’
R e p h r a s e s : [ ’ 1 8 + ’ , ’18 and o l d e r ’ , ’18 and above ’ , ’18 y e a r s and above ’ ,
’18 y e a r s o l d and o l d e r ’ ]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’How many " Y P l a n n e r s " a r e g o i n g t o " Daytime Tour At The
N a t i o n a l L e p r e c h a u n Museum " ? ’
Answer e l e m e n t s : [ ’ 3 5 + ’ ]

14
F u l l a n s w e r : ’ At l e a s t 35 " Y P l a n n e r s " a r e g o i n g t o " Daytime Tour At The
N a t i o n a l L e p r e c h a u n Museum " . ’
R e p h r a s e s : [ ’ 3 5 + ’ , ’35 o r more ’ , ’ a t l e a s t 3 5 ’ ]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’How many i t e m s a r e t h e r e i n " A l l S t r e a m s " ? ’
Answer e l e m e n t s : [ ’ 1 ’ ]
F u l l answer : ’ There i s 1 item in " All Streams " . ’
R e p h r a s e s : [ ’ 1 ’ , ’ one ’ ]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’ What i s t h e s t a t u s o f " A u t o m a t i c R e f r e s h " ? ’
Answer e l e m e n t s : [ ’ o f f ’ ]
F u l l a n s w e r : ’ The s t a t u s o f " A u t o m a t i c R e f r e s h " i s " o f f " . ’
Rephrases : [ ’ off ’ , ’ disabled ’ ]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’ What i s t h e l o c a t i o n ? ’
Answer e l e m e n t s : [ ’ G a y l o r d O p r y l a n d R e s o r t
N a s h v i l l e , TN ’ ]
F u l l a n s w e r : ’ The a d d r e s s i s G a y l o r d O p r y l a n d R e s o r t , N a s h v i l l e , TN . ’
R e p h r a s e s : [ ’ G a y l o r d O p r y l a n d R e s o r t , N a s h v i l l e , TN’ , ’ G a y l o r d O p r y l a n d
Resort , Nashville , Tennessee ’ ]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’ What i s t h e a p p l i c a t i o n name ? ’
Answer e l e m e n t s : [ ’ N a i l s . Makeup . H a i r s t y l e ’ ]
F u l l a n s w e r : ’ The name o f t h e a p p l i c a t i o n i s " N a i l s . Makeup . H a i r s t y l e " . ’
R e p h r a s e s : [ ’ N a i l s . Makeup . H a i r s t y l e ’ ]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’ Where i s t h e s t o r e l o c a t e d ? ’
Answer e l e m e n t s : [ ’ Boston , MA’ ]
F u l l a n s w e r : ’ The s t o r e i s i n Boston , M a s s a c h u s e t t s . ’
R e p h r a s e s : [ ’ Boston , MA’ , ’ Boston , M a s s a c h u s e t t s ’ ]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’ Which t a b i s s e l e c t e d ? ’
Answer e l e m e n t s : [ ’ ATP World Tour ’ ]
F u l l a n s w e r : ’ The s e l e c t e d t a b i s "ATP World Tour " . ’
R e p h r a s e s : [ ’ ATP World Tour ’ , ’ "ATP World Tour " t a b ’ ]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’When was t h e " F e e l i n g " p o s t p u b l i s h e d ? ’
Answer e l e m e n t s : [ ’ 1 3 h o u r s ago ’ ]
F u l l a n s w e r : ’ I t was p u b l i s h e d 13 h o u r s ago . ’
R e p h r a s e s : [ ’ 1 3 h o u r s ago ’ ]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’ What i s t h e t e m p e r a t u r e on F r i d a y ? ’
Answer e l e m e n t s : [ ’ 0 ◦ | 3 ◦ ’ ]
F u l l a n s w e r : ’ The t e m p e r a t u r e on F r i d a y i s a h i g h o f 3 ◦ and a low o f 0 ◦ . ’
R e p h r a s e s : [ ’ 0 ◦ | 3 ◦ ’ , ’ from 0 ◦ t o 3 ◦ ’ , ’ from 0 d e g r e e s to 3 degrees ’ , ’
b e t w e e n 0 ◦ and 3 ◦ ’ , ’ b e t w e e n 0 d e g r e e s and 3 d e g r e e s ’ , ’ a high of 3◦
and a low o f 0 ◦ ’ , ’ a h i g h o f 3 d e g r e e s and a low o f 0 degrees ’ ]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’ What i s t h e maximum and minimum t e m p e r a t u r e i n W e s t e r n
S w i t z e r l a n d on Monday ? ’
Answer e l e m e n t s : [ ’ −4 ◦ | −1 ◦ ’ ]
F u l l a n s w e r : ’ The maximum and minimum t e m p r a t u r e s i n W e s t e r n S w i t z e r l a n d
a r e −1 ◦ and −4 ◦ , r e s p e c t i v e l y . ’
R e p h r a s e s : [ ’ −4 ◦ | −1 ◦ ’ , ’ −1 ◦ and −4 ◦ ’ , ’ −1 d e g r e e s and −4 d e g r e e s ’ , ’ −1 ◦
and −4 ◦ r e s p e c t i v e l y ’ , ’ −1 d e g r e e s and −4 d e g r e e s r e s p e c t i v e l y ’ , ’
maximum −1 ◦ and minimum −4 ◦ ’ , ’ maximum −1 d e g r e e s and minimum −4

15
d e g r e e s ’ , ’ minimum −4 ◦ and maximum −1 ◦ ’ , ’ minimum −4 d e g r e e s and
maximum −1 d e g r e e s ’ ]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’ What i s t h e name o f a p p l i c a t i o n ? ’
Answer e l e m e n t s : [ ’ b a b y c e n t e r ® ’ ]
F u l l a n s w e r : ’ The a p p l i c a t i o n i s named " b a b y c e n t e r " . ’
Rephrases : [ ’ babycenter ’ ]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’ What i s t h e app name ? ’
Answer e l e m e n t s : [ ’ p o p c o r n f l i x™’ ]
F u l l a n s w e r : ’ The name o f t h e a p p l i c a t i o n i s " p o p c o r n f l i x " . ’
Rephrases : [ ’ popcornflix ’ ]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’ What i s t h e s u p p o r t e m a i l a d d r e s s ? ’
Answer e l e m e n t s : [ ’ s u p p o r t @ s t o n e k i c k . com ’ ]
F u l l a n s w e r : ’ The s u p p o r t e m a i l a d d r e s s i s s u p p o r t @ s t o n e k i c k . com . ’
R e p h r a s e s : [ ’ s u p p o r t @ s t o n e k i c k . com ’ ]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’ What ’ s t h e c u r r e n t l y p l a y i n g t r a c k name ? ’
Answer e l e m e n t s : [ ’ Vibe S t e p ’ ]
F u l l a n s w e r : ’ The name o f t h e c u r r e n t l y p l a y i n g t r a c k i s " Vibe S t e p " . ’
R e p h r a s e s : [ ’ Vibe S t e p ’ , ’ " Vibe S t e p " t r a c k ’ ]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’ What i s t h e c a p i t a l o f t h e Aland I s l a n d s ? ’
Answer e l e m e n t s : [ ’ Mariehamn ’ ]
F u l l a n s w e r : ’ The c a p i t a l o f t h e Aland I s l a n d s i s Mariehamn . ’
R e p h r a s e s : [ ’ Mariehamn ’ ]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’ What i s C e n t r a l E u r o p e a n S t a n d a r d Time i n A l b a n i a ? ’
Answer e l e m e n t s : [ ’GMT+ 1 : 0 0 ’ ]
F u l l a n s w e r : ’ The C e n t r a l E u r o p e a n S t a n d a r d Time i n A l b a n i a i s GMT+ 1 : 0 0 . ’
R e p h r a s e s : [ ’GMT+ 1 : 0 0 ’ , ’1 h o u r a h e a d o f GMT’ ]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’How many t i m e s a week d o e s t h e a c t i v i t y n e e d t o be p e r f o r m e d
to succeed ? ’
Answer e l e m e n t s : [ ’ 3 ’ ]
F u l l a n s w e r : ’ The a c t i v i t y n e e d s t o be p e r f o r m e d 3 t i m e s a week t o
succeed . ’
R e p h r a s e s : [ ’ 3 ’ , ’ t h r e e ’ , ’3 t i m e s p e r week ’ , ’3 t i m e s a week ’ , ’ t h r e e
t i m e s p e r week ’ , ’ t h r e e t i m e s a week ’ ]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’Who i s t h e s i n g e r o f " D i r t y S p r i t e 3 " ? ’
Answer e l e m e n t s : [ ’ Amero S h o t t a ’ ]
F u l l a n s w e r : ’ The s i n g e r o f " D i r t y S p r i t e 3" i s Amero S h o t t a . ’
R e p h r a s e s : [ ’ Amero S h o t t a ’ ]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’ I s t h e r e any F a s t P a s s a v a i l a b l e f o r " T u r t l e T a l k With C r u s h " ? ’
Answer e l e m e n t s : [ ’NO FASTPASS AVAILABLE ’ ]
F u l l a n s w e r : ’ " T u r t l e T a l k With C r u s h " h a s no a v a i l a b l e F a s t P a s s . ’
R e p h r a s e s : [ ’ no ’ , ’ no F a s t P a s s a v a i l a b l e ’ ]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’How many v i e w s i n t o t a l a r e shown on t h e v i d e o "How t o P l a y
Indoor Soccer "? ’
Answer e l e m e n t s : [ ’ 4 7 , 9 9 8 ’ ]

16
F u l l a n s w e r : ’ T h e r e a r e 4 7 , 9 9 8 shown v i e w s i n t o t a l on t h e v i d e o "How t o
Play Indoor Soccer " . ’
Rephrases : [ ’47 ,998 ’ , ’47998 ’]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’ What c a n we s e a r c h f o r i n " S e a r c h Homes " ? ’
Answer e l e m e n t s : [ ’ c i t y , z i p , beds , b a t h , p r i c e ’ ]
F u l l a n s w e r : ’ You c a n s e a r c h f o r " c i t y , z i p , beds , b a t h , p r i c e " i n "
S e a r c h Homes " . ’
R e p h r a s e s : [ ’ c i t y , z i p , beds , bath , price ’ ]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’ Which exam was h e l d on A p r i l 1 1 , 2 0 1 6 ? ’
Answer e l e m e n t s : [ ’ IBPS RRB PO PRE : Memory Based S e t ’ ]
F u l l a n s w e r : ’ The exam h e l d on A p r i l 1 1 , 2 0 1 6 , i s " IBPS RRB PO PRE :
Memory Based S e t " . ’
R e p h r a s e s : [ ’ IBPS RRB PO PRE : Memory Based S e t ’ , ’ " IBPS RRB PO PRE :
Memory Based S e t " exam ’ ]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’ What do we n e e d t o do t o s a v e more ? ’
Answer e l e m e n t s : [ ’ j o i n VIP ’ ]
F u l l a n s w e r : ’ To s a v e more , you n e e d t o j o i n VIP . ’
R e p h r a s e s : [ ’ j o i n VIP ’ ]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’ What o p t i o n i s shown i n " M i d d l e " ? ’
Answer e l e m e n t s : [ ’ P r e s s u r e ’ ]
F u l l a n s w e r : ’ The shown " M i d d l e " o p t i o n i s " P r e s s u r e " . ’
Rephrases : [ ’ Pressure ’ , ’" P r e s s u r e " option ’ ]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’ What day i s A p r i l 8 , 2 0 1 7 ? ’
Answer e l e m e n t s : [ ’ S a t u r d a y , ’ ]
F u l l a n s w e r : ’ A p r i l 8 , 2017 i s S a t u r d a y . ’
Rephrases : [ ’ Saturday ’ ]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’ What i s c a l c u l a t e d u s i n g t h e kg u n i t ? ’
Answer e l e m e n t s : [ ’ Weight ’ ]
F u l l a n s w e r : ’ The kg u n i t i s u s e d t o c a l c u l a t e w e i g h t . ’
R e p h r a s e s : [ ’ Weight ’ ]

Now i s y o u r t u r n .
Q u e s t i o n : ’THE QUESTION’
Answer e l e m e n t s : [ ’THE UI ELEMENT DESCRIPTION ’ ]
F u l l a n s w e r : ’THE FULL−SENTENCE ANSWER’
Rephrases :

B.2 For answers contained in multiple UI elements


L i s t v a r i o u s ways t o r e p h r a s e t h e a n s w e r . The a n s w e r s h o u l d be a s s h o r t
a s p o s s i b l e , w i t h o u t e x t r a words from t h e q u e s t i o n . Use a l l p r o v i d e d
elements in each answer . Provide t h e o u t p u t in s qu ar e b r a c k e t s .

Here i s an e x a m p l e :
Q u e s t i o n : ’ What ’ s t h e t e m p e r a t u r e ? ’
Answer e l e m e n t s : [ ’ 5 9 ’ , ’ ◦ F ’ ]
F u l l a n s w e r : ’ The t e m p e r a t u r e i s 59 d e g r e e s F a h r e n h e i t . ’
R e p h r a s e s : [ ’ 5 9 ◦ F ’ , ’59 F a h r e n h e i t s ’ , ’59 d e g r e e s F a h r e n h e i t ’ ]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’ What i s t h e name ? ’
Answer e l e m e n t s : [ ’ Jon ’ , ’ Brown ’ ]

17
F u l l a n s w e r : ’ The name i s J o n Brown . ’
R e p h r a s e s : [ ’ J o n Brown ’ ]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’ What i s t h e r e s t i n t e r v a l d u r a t i o n ? ’
Answer e l e m e n t s : [ ’ 0 0 ’ , ’ : ’ , ’ 3 4 ’ ]
F u l l a n s w e r : ’ The r e s t i n t e r v a l l a s t s 0 0 : 3 4 . ’
R e p h r a s e s : [ ’ 0 0 : 3 4 ’ , ’34 s e c o n d s ’ , ’0 m i n u t e s and 34 s e c o n d s ’ , ’34
m i n u t e s ’ , ’0 h o u r s and 34 m i n u t e s ’ ]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’ What a c c o u n t s c a n I u s e t o s i g n up ? ’
Answer e l e m e n t s : [ ’ Facebook ’ , ’ T w i t t e r ’ ]
F u l l a n s w e r : ’ You c a n s i g n up w i t h " F a c e b o o k " and " T w i t t e r " . ’
R e p h r a s e s : [ ’ Facebook , T w i t t e r ’ , ’ F a c e b o o k and T w i t t e r ’ , ’ F a c e b o o k o r
Twitter ’]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’ What a r e t h e o p t i o n s a v a i l a b l e f o r s h a r i n g ? ’
Answer e l e m e n t s : [ ’ Facebook ’ , ’ E n j i n ’ , ’ Email ’ , ’ Fake GPS − S e a r c h
l o c a t i o n ’ , ’ A n d r o i d Beam ’ , ’ B l u e t o o t h ’ , ’ Messaging ’ ]
F u l l a n s w e r : ’ The a v a i l a b l e s h a r i n g o p t i o n s a r e " F a c e b o o k " , " E n j i n " , "
E m a i l " , " Fake GPS − S e a r c h l o c a t i o n " , " A n d r o i d Beam " , " B l u e t o o t h " ,
and " M e s s a g i n g " . ’
R e p h r a s e s : [ ’ " F a c e b o o k " , " E n j i n " , " E m a i l " , " Fake GPS − S e a r c h l o c a t i o n " ,
" A n d r o i d Beam " , " B l u e t o o t h " , " M e s s a g i n g " ’ , ’ " F a c e b o o k " , " E n j i n " , "
E m a i l " , " Fake GPS − S e a r c h l o c a t i o n " , " A n d r o i d Beam " , " B l u e t o o t h " and
" Messaging " ’ ]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’ What a r e t h e a v a i l a b l e q u e s t i o n s ? ’
Answer e l e m e n t s : [ ’Why M i c h a e l F l y n n k e p t h i s J o b 17 d a y s a f t e r t h e White
House ! ’ , ’ C a r i n g makes g i r l s r u n away ? ’ ]
F u l l a n s w e r : ’ The a v a i l a b l e q u e s t i o n s a r e "Why M i c h a e l F l y n n k e p t h i s J o b
17 d a y s a f t e r t h e White House ! " and " C a r i n g makes g i r l s r u n away ? " . ’
R e p h r a s e s : [ ’ " Why M i c h a e l F l y n n k e p t h i s J o b 17 d a y s a f t e r t h e White
House ! " , " C a r i n g makes g i r l s r u n away ? " ’ , ’ "Why M i c h a e l F l y n n k e p t
h i s J o b 17 d a y s a f t e r t h e White House ! " and " C a r i n g makes g i r l s r u n
away ? " ’ ]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’ Which t o u r n a m e n t s a r e s c h e d u l e d from F e b r u a r y 20 t o 2 6 ? ’
Answer e l e m e n t s : [ ’ Rio de J a n e i r o ’ , ’ D e l r a y Beach ’ ]
F u l l a n s w e r : ’ The t o u r n a m e n t s s c h e d u l e d from F e b r u a r y 20 t o F e b r u a r y 26
a r e " Rio de J a n e i r o " and " D e l r a y Beach " . ’
R e p h r a s e s : [ ’ " Rio de J a n e i r o " , " D e l r a y Beach " ’ , ’ " Rio de J a n e i r o " and "
D e l r a y Beach " ’ ]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’How many v i s i t o r s d o e s " b o b b y j o n e s 1 4 " h a v e ? ’
Answer e l e m e n t s : [ ’ 1 ’ , ’ 3 ’ ]
F u l l a n s w e r : ’ The u s e r " b o b b y j o n e s 1 4 " h a s 3 v i e w s and 1 view . ’
R e p h r a s e s : [ ’ 1 and 3 ’ , ’ one and t h r e e ’ , ’ 1 , 3 ’ , ’ one , t h r e e ’ ]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’ What a r e t h e a v a i l a b l e o p t i o n s i n " Most p o p u l a r " ? ’
Answer e l e m e n t s : [ ’ U n i t e d S t a t e s ’ , ’ U n i t e d Kingdom ’ , ’ I n d i a ’ , ’ Canada ’ , ’
A u s t r a l i a ’ , ’ Nepal ’ ]
F u l l a n s w e r : ’ The a v a i l a b l e o p t i o n s a r e " U n i t e d S t a t e s " , " U n i t e d Kingdom
" , " I n d i a " , " Canada " , " A u s t r a l i a " and " N e p a l " . ’
R e p h r a s e s : [ ’ " U n i t e d S t a t e s " , " U n i t e d Kingdom " , " I n d i a " , " Canada " , "
A u s t r a l i a " , " N e p a l " ’ , ’ " U n i t e d S t a t e s " , " U n i t e d Kingdom " , " I n d i a " , "
Canada " , " A u s t r a l i a " and " N e p a l " ’ , ’ U n i t e d S t a t e s , U n i t e d Kingdom ,
I n d i a , Canada , A u s t r a l i a , Nepal ’ , ’ U n i t e d S t a t e s , U n i t e d Kingdom ,
I n d i a , Canada , A u s t r a l i a and Nepal ’ ]

18
Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’ What i s t h e w i n n i n g number f o r A p r i l 8 , 2 0 1 7 ? ’
Answer e l e m e n t s : [ ’ 2 3 ’ , ’ 3 6 ’ , ’ 5 1 ’ , ’ 5 3 ’ , ’ 6 0 ’ , ’ 1 5 ’ ]
F u l l a n s w e r : ’ The w i n n i n g numbers f o r A p r i l 8 , 2 0 1 7 , a r e 23 − 36 − 51 −
53 − 60 − 1 5 . ’
R e p h r a s e s : [ ’ 2 3 − 36 − 51 − 53 − 60 − 1 5 ’ , ’ 2 3 , 3 6 , 5 1 , 5 3 , 6 0 , 1 5 ’ ]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’ What a r e t h e r e c e n t s e a r c h e s ? ’
Answer e l e m e n t s : [ ’ Hong Kong , Hong Kong ’ , ’SFO ⇄ ORD’ ]
F u l l a n s w e r : ’ The r e c e n t s e a r c h e s a r e " Hong Kong , Hong Kong " and "SFO ⇄
ORD" . ’
R e p h r a s e s : [ ’ " Hong Kong , Hong Kong " , "SFO ⇄ ORD" ’ , ’ " Hong Kong , Hong
Kong " and "SFO ⇄ ORD" ’ ]

Here i s a n o t h e r e x a m p l e :
Q u e s t i o n : ’ Which two c o u n t r i e s a r e p l a y i n g l i v e ? ’
Answer e l e m e n t s : [ ’ IND ’ , ’BAN’ ]
F u l l a n s w e r : ’ The two c o u n t r i e s t h a t a r e p l a y i n g l i v e a r e I n d i a and
Bangladesh . ’
R e p h r a s e s : [ ’ I n d i a , B a n g l a d e s h ’ , ’ I n d i a and B a n g l a d e s h ’ , ’IND , BAN’ , ’IND
and BAN’ ]

Now i s y o u r t u r n .
Q u e s t i o n : ’THE QUESTION’
Answer e l e m e n t s : [ ’THE FIRST UI ELEMENT DESCRIPTION ’ , ...]
F u l l a n s w e r : ’THE FULL−SENTENCE ANSWER’
Rephrases :

C Data examples
Tables 4 and 5 contain a few examples from the ScreenQA dataset. Note that bounding boxes of
selected UI elements are highlighted on the screenshot, but they are not actually present in the
corresponding image from RICO (Deka et al., 2017).

D Dataset Analysis

Figure 9: Histogram for the types of questions.

In this sections we show some additional analysis of the collected data.


Table 6 shows the detailed question distribution by categories. Alternatively, Figure 9 shows distribu-
tion of question types regardless of the subject.

19
Table 4: Examples from ScreenQA dataset

Question: ‘When was the match held at Kent State?’


UI elements list:
• [09/24]
Full-sentence answers:
• The match was held at Kent State on September 24.
• The match was held on September 24.
Generated list of short answers:
• 09/24
• September 24
• September 24th
• 9/24

Question: ‘What is the birth date of the user?‘


UI elements list:
• [1999], [January], [1]
• [January], [1], [1999]
• [1], [January], [1999]
Full-sentence answers:
• The birth date of the user is January 1, 1999.
• The user’s birth date is January 1, 1999.
• The birth date is January 1, 1999.
Generated list of short answers:
• 1/1/1999
• January 1, 1999
• 1 January 1999
• 1 January, 1999
• January 1st, 1999

Question: ‘What is the status of "Open links inside the app"?‘


UI elements list:
• [off ]
Full-sentence answers:
• The status of "Open links inside the app" is "off".
• The status is "off".
Generated list of short answers:
• off
• disabled

20
Table 5: Examples from ScreenQA dataset

Question: ‘What is the odometer reading?’


UI elements list:
• [0 m]
Full-sentence answers:
• The odometer reading is 0m.
• The odometer reading is 0 m.
Generated list of short answers:
• 0m
• 0 meters

Question: ‘What other applications can be used?’


UI elements list:
• [Android Beam], [Bluetooth]
• [Facebook], [Android Beam], [Bluetooth]
Full-sentence answers:
• The applications that can be used are "Facebook", "Android
Beam", and "Bluetooth".
• The other applications that can be used are "Android Beam"
and "Bluetooth".
Generated list of short answers:
• Android Beam, Bluetooth
• Android Beam and Bluetooth
• "Android Beam", "Bluetooth"
• "Android Beam" and "Bluetooth"
• Facebook, Android Beam, Bluetooth
• Facebook and Android Beam and Bluetooth
• Facebook or Android Beam or Bluetooth

E Additional Model Evaluation Results

In Section 6 we report the performance of ScreenAI 5B (Baechler et al., 2024) and PaliGemma 3B (de-
velopment contributors et al., 2024) models on ScreenQA tasks after fine-tuning. It sets the baselines
for corresponding model sizes. Here you can find additional evaluations we performed for publicly
available models.

E.1 PaliGemma 3B fine-tuning

Pre-trained checkpoints for PaliGemma 3B model are available in 3 resolutions: 224×224, 448×448
and 896 × 896. To evaluate the influence of input image resolution on model quality, we fine-tuned
all 3 checkpoints on all 4 ScreenQA tasks (see Section 5), keeping all other fine-tuning parameters
the same (10 epochs, learning rate 1e − 5, using adam optimizer with cosine decay schedule). You
can see the results in the Table 7.

21
Table 6: Question category distribution and examples.
Category % Examples
UI selection & config 18.1 Which option is selected? What is the selected ringtone?
Quantity number 11.7 How many unread messages? How many pictures are there in Western Europe?
App name 10.4 What is the name of the application? What is the app name?
Date time 9.4 When was “Heal the Living” released? When is happy hour?
Price 3.4 How much is the gift bonus in 3rd place? What is the price?
Name of item 3.3 What is the name of the drug? What is the name of chef?
User name 2.8 What is the name of the user? What is the username on telegram?
Duration 2.5 What is the duration of video? How long is the song?
Enum. of avail. options 2.5 Which social media options are given there? What are the options available for logging in?
Address and direction 2.4 What is the current location? What is the service zip code?
Email address 2.4 What is an email address? What is customer service email?
Person’s name 2.1 Who sang the song? What is the last name?
Signup/login 1.6 Which application can be used to sign up / login? What are the alternative choices for signing up?
Version information 1.6 What is the version number? What is the new feature in version v3.1.3?
Weather 1.5 What is the range of temperature shown on Sunday? What is the weather forecast for Sunday?
Score & value 1.4 What is height/weight of the person? What is the score?
Yes/No 1.1 Is there any travel plans? Is there any favorite?
Phone number 1.0 What is the phone number? What is the prefix for the international mobile number?
# of Stars 0.8 What is the star rating? How many stars are given to the product?
Share/sharing 0.8 Which application can be used to share? Where can I share this application?
Age 0.8 How old is ...? What is the age?
Percentage 0.7 What is the percentage of ... ? What is the brightness percentage for foreground?
Settings 0.6 What is the setting of ... ? Which settings are switched on?
Quantity amount 0.6 How much fat is there? What is the amount?
Permission 0.5 Which application is asking for permissions? What permissions are required for MyCarTracks?
# of Likes 0.5 How many likes for ... ? How many likes does ... get?
Country 0.5 What is the name of the country? Which country has the +54 code?
Distance 0.5 What is the visibility distance? How far is it from ... ?
# of Reviews 0.4 What is the number of comments on ... ? How many comments?
Website 0.3 What is the url? What’s the website address?
Gender 0.3 What is the gender? Which gender is displayed on the screen?
How to 0.3 How to start on boot? How to pronounce his name?
Currency 0.3 What is the currency? What is the currency for the price?
Unit of measurement 0.2 What is the unit of temperature? What is the unit of weight and length?
Language 0.1 Which language is used in the setting? Which language is being translated into which language?
Color 0.0 What is the UI color? What is the amount of green color?
Others 12.8 What’s the average speed? What is the user’s middle initial
What is the spending limit? Which team has 41 points?
Total 100.0

Table 7: Performance of fine-tuned PaliGemma 3B models on proposed task types. Bold is best
performance.

SQA-S SQA-L SQA-UIC SQA-UIC-BB


EM F1 R-1 R-2 R-L EM F1 BBOX-F1 EM F1
PaliGemma 3B 896 89.4 93.2 90.9 85.3 90.1 86.1 87.8 88.8 78.8 81.2
PaliGemma 3B 448 88.3 92.2 91.1 85.5 90.3 86.0 87.7 89.4 79.1 81.6
PaliGemma 3B 224 77.5 83.9 88.2 81.5 87.4 74.8 76.7 84.9 67.5 69.6

224 × 224 image resolution appears to be too low to capture all the necessary details about the
screen. Somewhat unexpected, but there seem to be almost no difference in PaliGemma 3B model
performance for resolutions 448 × 448 and 896 × 896.

E.2 Zero-shot evaluations for SQA-S task

Question answering in one form or another is one of the most common tasks for LLMs. This is
also true for VLMs. We therefore attempted to evaluate some of those in a zero-shot setting: Fuyu-
8b2 (Bavishi et al., 2023), Gemini 1.5 Flash3 , Gemini 1.5 Pro4 (Gemini Team Google, 2023), and

2
https://www.adept.ai/blog/fuyu-8b/
3
https://deepmind.google/technologies/gemini/flash/
4
https://deepmind.google/technologies/gemini/pro/

22
Table 8: Zero-shot performance of public models on SQA-S task. Bold is best performance.

SQA-S
SQuAD-EM SQuAD-F1
Fuyu-8b 39.5 47.3
Gemini 1.5 Flash 74.6 83.2
Gemini 1.5 Pro 79.8 86.6
GPT-4o 77.8 86.6

GPT-4o5 (OpenAI et al., 2024). SQA-S task was used for that as the one most similar to other existing
benchmarks. Please see Table 8 with all results.
For each model the prompt used in the evaluation is an instruction followed by a question. For the
Fuyu-8b model we used the instruction that model’s authors recommended in their examples6 for
similar tasks: “Answer the following DocVQA question based on the image. \n ”.
For GPT-4o we tried a couple of different instructions and checked the model output for those on a
sample of 10 examples from validation split, and picked the one where results were closest to the
expected output: “Answer the question based on the screenshot only. Do not use any other sources of
information. The answer should be succinct and as short as possible. If the answer is a text from
the image, provide it exactly without rephrasing or augmenting. If there is no answer on the image,
output "<no answer>".\n”. We don’t claim this is the best prompt to use for this model, nor that we
did a very thorough job to find one.
We then re-used this same prompt for the Gemini models evaluation. While the lack of prompt
engineering specifically for the Gemini models puts them in disadvantage compared to GPT-4o,
results show that even in this setting Gemini 1.5 Pro outperforms GPT-4o by a small margin.

5
https://openai.com/index/hello-gpt-4o/
6
See https://huggingface.co/adept/fuyu-8b/discussions/28

23

You might also like