0% found this document useful (0 votes)
74 views16 pages

ScreenAI A Vision-Language Model For UI and Infographics Understanding

Uploaded by

469134492
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views16 pages

ScreenAI A Vision-Language Model For UI and Infographics Understanding

Uploaded by

469134492
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Gilles Baechler ∗ , Srinivas Sunkara ∗ , Maria Wang ∗ , Fedir Zubach , Hassan Mansoor ,
Vincent Etter , Victor Cărbune , Jason Lin , Jindong Chen ∗† , Abhanshu Sharma †
Google DeepMind

Abstract that can understand, reason, and interact on top of pictorial


arXiv:2402.04615v3 [cs.CV] 4 Jul 2024

pixels. To address this challenge, we introduce ScreenAI, a


Screen user interfaces (UIs) and infographics, shar- Vision-Language Model (VLM) for comprehensive UI and
ing similar visual language and design princi- infographics understanding, including tasks such as question-
ples, play important roles in human communica- answering (QA) on infographics (charts, illustrations, maps,
tion and human-machine interaction. We introduce etc.), and element annotation, summarization, navigation, and
ScreenAI, a vision-language model that specializes QA on UIs. Our model combines the PaLI [Chen et al.,
in UI and infographics understanding. Our model 2023b] architecture with the flexible patching mechanism of
improves upon the PaLI architecture with the flexi- Pix2struct [Lee et al., 2023] and handles vision tasks by re-
ble patching strategy of pix2struct and is trained on casting them as (text, image)-to-text problems. Figure 1 pro-
a unique mixture of datasets. At the heart of this vides a high level description of the model architecture and
mixture is a novel screen annotation task in which Section 2.1 describes its components in more detail.
the model has to identify the type and location of The main contributions of this work are multifold and
UI elements. We use these text annotations to de- greatly advance the field of digital content understanding:
scribe screens to Large Language Models and au-
tomatically generate question-answering (QA), UI • We propose ScreenAI, a Vision-Language Model
navigation, and summarization training datasets at (VLM), as a holistic solution that focuses on understand-
scale. We run ablation studies to demonstrate the ing UIs and infographics, taking advantage of their com-
impact of these design choices. At only 5B parame- mon visual language and design sophistication.
ters, ScreenAI achieves new state-of-the-art results • We introduce a textual representation for UIs, which we
on UI- and infographics-based tasks (Multipage use to teach our model how to understand UIs during its
DocVQA, WebSRC, and MoTIF), and new best-in- pretraining phase.
class performance on others (ChartQA, DocVQA,
and InfographicVQA) compared to models of simi- • We take advantage of this new UI representation and
lar size. Finally, we release three new datasets: one Large Language Models (LLMs) to automatically gen-
focused on the screen annotation task and two oth- erate training data at scale.
ers focused on question answering. • We define pretraining and fine-tuning mixtures which
cover a wide spectrum of tasks in UI and infographic
understanding.
1 Introduction
• We release three evaluation datasets for tasks described
Infographics, such as charts, diagrams, illustrations, maps, in Section 4.2: Screen Annotation, ScreenQA Short, and
tables, and document layouts have long been a cornerstone Complex ScreenQA. These datasets enable the research
of effective communication, thanks to their ability to distill community to utilize our textual representation and al-
complex data and ideas into simple illustrations through ar- low for a more comprehensive benchmarking of models
rangement of layouts, and visual cues. In the digital era, mo- for screen-based question answering.
bile and desktop UIs, sharing similar design principles and
visual languages with infographics, facilitate human commu- These innovations position ScreenAI as the go-to VLM
nication and human-machine interface with rich and interac- for any digital content understanding task, ranging from UIs
tive user experiences. to infographics, and beyond. At a modest size of 4.6 bil-
Although the above observation suggests an opportunity lion parameters, dated on January 17, 20241 , our model ex-
for a unified model, because of their complexity, infographics hibits state-of-the-art (SoTA) performance on three public in-
and UIs present a unique challenge to building a single model fographics QA benchmarks, surpassing other models 10x or
more in size. In other tasks, ScreenAI exhibits best-in-class,

Equal contribution. Correspondence: [email protected]
† 1
Project leads The full paper submission deadline of IJCAI-24.

Accepted for presentation at the International Joint Conference on Artificial Intelligence (IJCAI), 2024
Screen
T5 Multimodal T5 Decoder
Encoder
Vision embed
pix2struct
Encoder + xN Self-attn
patching Cross-attn
(ViT) concat
+ FFW xN

Cross-attn
K, V + FFW

4x6

5x5

Aspect ratio preserving grid


with max e.g 25 patches Model predictions
‘K12 Schools Tulsa
Text input Area’
‘Question: What is the
text in the search bar’?

Figure 1: The overall architecture of our model. The model contains an image encoder followed by a multimodal encoder consuming
embedded text and image features. The output of the multimodal encoder is fed to an autoregressive decoder to generate the final text output.
This figure also illustrates pix2struct patching, where the grid size adapts to the aspect ratio and shape of the image.

or close-to-best performance. We show in Section 5.2 that for solving various document-understanding tasks (e.g. Lay-
the model performance gets better as we increase its size, outLMv3 [Huang et al., 2022], Donut [Kim et al., 2021],
suggesting that there is a strong potential for further gains pix2struct [Lee et al., 2023], MatCha [Liu et al., 2022],
in performance by scaling up the model. UDOP [Tang et al., 2023], or Spotlight [Li and Li, 2022]).
Another example is VuT [Li et al., 2021], which is made of
1.1 Related Work a multimodal encoder, followed by a text decoder and a ded-
We identify three categories of closely related works. icated head for object detection tasks.
Screen-Based UI Models. Until recently, most screen un- Other approaches like UIBert [Bai et al., 2021], Do-
derstanding efforts focused on well-defined tasks with a nar- cLLM [Wang et al., 2023] perform screen- and document-
row scope. Examples include the detection of icons [Zang understanding using additional textual data extracted from
et al., 2021] or various UI elements [Zhang et al., 2021; metadata like DOM or ancillary models like OCR.
Sunkara et al., 2022; Li et al., 2022a], together with their In our paper, we introduce pre-training tasks along with
structure [Wu et al., 2021]. Other notable works encompass a data generation schema using self-supervision and model-
the description of icons (widget captioning) [Li et al., 2020], based annotation. Prior work with self-supervised learning
screen summarization [Wang et al., 2021], and single-step tasks have typically been focused on one domain. For ex-
navigation tasks [Wichers et al., 2018; Li et al., 2022b]. An- amples, pix2struct [Lee et al., 2023], HTLM [Aghajanyan
other direction is to use LLMs to classify and describe UI et al., 2021] are focused on web-pages; ActionBert [He et
elements [Gur et al., 2022], or complete tasks [Nakano et al., al., 2021], UIBert [Bai et al., 2021] are focused on mobile
2021; Rawles et al., 2023; Deng et al., 2023]. apps, which can capture a subset of the elements like text
and exclude hierarchy information. Our representation, in-
Generalist Foundation Models. The advent of large foun- ferred from only screen or image pixels, is applicable to a
dation models, particularly in the multimodal domain, has led wide range of domains beyond web-pages and mobile apps,
to the development of versatile and unified models. These including documents, infographics, etc. Compared to prior
universal models excel in a broad spectrum of image un- work, our model achieves superior performance on down-
derstanding tasks formulated through natural language such stream tasks. We hypothesize this is due to the positive trans-
as question-answering, image captioning, and object local- fer of performance when using screen, document and info-
ization. (e.g. UniTAB [Yang et al., 2022], OFA [Wang graphics data jointly in the pre-training mixture. Given the
et al., 2022], PaLI [Chen et al., 2022; Chen et al., 2023a; abundance of data in each of these domains, we believe future
Chen et al., 2023b], Flamingo [Alayrac et al., 2022], or research in this direction can result in further improvements.
MaMMUT [Kuo et al., 2023]). Foundational work also in-
cludes pix2seq [Chen et al., 2021a], which recasts the object
detection problem as a text prediction task.
2 Methodology
Efficient Vision-Language Models. Closer to the domain 2.1 Architecture
of screen and document understanding, similar transformer- Our model architecture as shown in Figure 1 is inspired by the
based [Vaswani et al., 2017] architectures have been proposed architecture of the PaLI family of models [Chen et al., 2022;
Chen et al., 2023a; Chen et al., 2023b], which is composed Model ViT Encoder-Decoder #params
of a multimodal encoder block with a vision encoder like
ViT [Dosovitskiy et al., 2020] and a mT5 [Xue et al., 2020; 670M B16 (92M) mT5 base (583M) 675M
Raffel et al., 2020] language encoder consuming image and 2B H14 (653M) mT5 Large (1.23B) 1.88B
text inputs, followed by an autoregressive decoder. The input 5B G14 (1.69B) UL2-3B (2.93B) 4.62B
image is transformed into a sequence of embeddings by the
vision encoder and these embeddings are concatenated with Table 1: Model variants and details of their parameter counts and
split among vision and language models. The image encoders are
the input text embeddings and fed into the mT5 language en-
based on ViT [Dosovitskiy et al., 2020] and the text encoders are
coder. The output of this encoder is passed to the decoder to based on mT5 [Xue et al., 2020] and UL2 models [Tay et al., 2022].
generate the text output. This generic formulation enables us
to use the same model architecture to solve a variety of vi-
sion and multimodal tasks that can be recast as a text+image imal human labeling (see Section 4.1 for a detailed descrip-
(input) to text (output) problem. Compared to the text input, tion of the pre-training mixture). Contrary to the later fine-
the image embeddings constitute a significant portion of the tuning stage, we train both the vision encoder and the lan-
input length to the multimodal encoder. guage model. The motivation behind training the vision en-
We further extend PaLI’s encoder-decoder architecture to coder is to incorporate the new patching strategy, and to allow
accept various image patching patterns. The original PaLI the model to adapt from natural images to UI-related images.
architecture only accepts a fixed grid pattern of patches for We evaluate the impact of training the vision encoder and of
processing the input images. However, the data we encounter including LLM generated data on a variety of tasks in our
in screen-related domains spans a wide variety of resolutions ablation experiments in Section 5.
and aspect ratios. To have a single model to work across all After some initial steps of pretraining, we perform addi-
screen shapes, it is necessary to use a patching strategy which tional steps with the ViT encoder frozen to further train the
can work well with images of various shapes. To this end, model while reducing the resource consumption.
we borrow a technique introduced in Pix2Struct [Lee et al., Fine-Tuning. During fine-tuning, the model is trained on
2023], which allows us to have image patches with arbitrary mixtures of tasks, most of which are labeled using human
grid shapes based on the input image shape and a pre-defined annotators. These tasks are described in details in Section 4.2.
maximum number of patches, as shown in Figure 1. This en- For QA-related tasks, we start by fine-tuning the model on a
ables us to accommodate input images of various formats and combination of QA-related tasks; then, additional training is
aspect ratios without the need for padding or stretching the performed on each individual tasks separately. For all other
image to a fixed shape, making our model more polyvalent to tasks, we fine-tune the model on each one individually.
handle both mobile (i.e. portrait) and desktop (i.e. landscape)
image formats. In Section 5, we evaluate the impact of each
of these modeling choices. 3 Automatic Data Generation
The pretraining phase of our model’s development is criti-
2.2 Model Configurations cally dependent on access to a vast and diverse dataset. Given
We train models of 3 different sizes containing 670M, 2B the impracticality of manually annotating such an extensive
and 5B parameters. For the 670M and 2B parameter models, dataset, our strategy focuses on automatic data generation.
we start from pre-trained unimodal checkpoints for the vi- This approach leverages specialized smaller models, each
sion encoder and the encoder-decoder language models. For adept at generating and labeling data both efficiently and with
the 5B parameter model, we start from the multimodal pre- a high degree of accuracy.
trained checkpoint from PaLI-3 [Chen et al., 2023a], where In this section, we provide a detailed account of our data
the ViT is trained together with the UL2 [Tay et al., 2022] generation process, particularly highlighting how we gather
based encoder-decoder language model. A breakdown of the and automatically annotate a diverse range of screenshots for
parameter distribution among the vision and language models pretraining our model. This automated approach is not only
can be seen in Table 1. efficient and scalable compared to manual annotation but also
Our patching strategy allows variable aspect ratios and in- ensures a level of data diversity and complexity.
put resolutions, as long as they fit within the allocated se-
3.1 Screen Annotation
quence length budget (2024 embeddings for the 670M model,
2916 embeddings for the 2B model, and 3364 embeddings Our initial step is to equip the model with a comprehensive
for the 5B model). For square images, the corresponding understanding of textual elements, various screen compo-
maximum input resolution is 720 × 720 for the 670M model, nents, and their overall structure and hierarchy. This founda-
756 × 756 for the 2B model, and 812 × 812 for the 5B model. tional understanding is vital for the model’s ability to interpret
and interact accurately with a wide range of user interfaces.
2.3 Stages of Training An extensive collection of screenshots has been amassed
from various devices, including desktops, mobile, and tablets,
In this section, we cover the different stages of training.
by crawling applications and web pages [Raffel et al., 2020].
Pre-Training. Starting from the checkpoints mentioned in These screenshots are then annotated with detailed labels that
Section 2.2, we do a first stage of training on large datasets describe the UI elements, their spatial relationships, and ad-
generated from self-supervision and other models, using min- ditional descriptive information.
Screen schema
generation Generated Data
mixture
Layout extraction (Optional) validation
LLM Icon classification Question-Answering
Icon classification LLM
(PaLM 2)
Navigation
Human
OCR
Summarization
Image captioning

Figure 2: Task generation pipeline: 1) the screens are first annotated using various models; 2) we then use an LLMs to generate screen-related
tasks at scale; 3) (optionally) we validate the data using another LLM or human raters.

The cornerstone of our annotation process is a layout an-


notator based on the DETR [Carion et al., 2020] detection
model. This object detector is apt at identifying and labeling
a wide range of UI elements such as IMAGE, PICTOGRAM,
BUTTON, TEXT, and others. This detector and the list of UI
elements is inspired by [Li et al., 2022a]. However, the mod-
els in [Li et al., 2022a] are classifiers and are provided a list
of candidate bounding boxes to annotate, whereas in our case
we predict the bounding boxes too.
Pictograms undergo further analysis using an icon classi-
fier [Sunkara et al., 2022] capable of distinguishing 77 differ-
ent icon types. This detailed classification is essential for in-
terpreting the subtle communication conveyed through icons.
For icons that are not covered by the classifier, infographics
Figure 3: Example of our screen schema. See Appendix B for more.
and images, we use the PaLI image captioning model [Chen
et al., 2023b]. This model generates descriptive captions that
provide contextual information, aiding in the comprehensive
understanding of the screen’s content. ous UI components but also their relationships to one another.
Additionally, an OCR engine extracts and annotates tex- Additionally, the screen schema proves to be an invaluable
tual content on screen. This step is crucial for interpreting natural language tool to interface with large language models
the textual information presented in various formats on in- (LLMs). By providing LLMs with a structured and detailed
terfaces. Finally, we combine the OCR text with the previ- representation of screen content, we enable the creation of
ous annotations to create a detailed and holistic description more intricate and contextually nuanced tasks.
of each screen. The bounding box coordinates are systemat-
ically included, providing spatial context to the elements on 3.2 LLMs to Generate Additional Tasks
the screen.
Figure 3 shows an example of the screen schema used in To infuse greater diversity into our pretraining data, we lever-
most of our pretraining tasks. Each schema contains: age the capabilities of LLMs, in particular PaLM 2-S [Anil et
al., 2023b] to generate Question-Answer pairs in two stages.
1. The UI element names.
Initially, we generate the screen schema as previously de-
2. The OCR text (when applicable). scribed. Subsequently, we craft a prompt incorporating the
3. The element descriptions, e.g. captioning or icon names. screen schema and direct the LLM to generate synthetic data.
This stage is empirical and necessitates a degree of prompt
4. The bounding box coordinates, quantized and normal- engineering. However, after several iterations, we typically
ized between 0 and 999. identify a prompt that effectively generates the desired task.
Parentheses are used to create a basic hierarchical structure Example of such prompts are shown in Appendix C. To eval-
between the elements, i.e. the children of a parent element uate the quality of these generated responses, we conducted
are all put inside a parenthesis block. For ease of visualiza- human validation on a subset of the data, ensuring that it
tion, the bounding boxes from the screen schema have been meets a predetermined quality threshold.
overlaid on the original screenshot. This approach is described in Figure 2 and it enables us
This schema plays a central role in our data generation for to create a variety of synthetic but realistic tasks that sig-
pretraining tasks, offering a detailed and multifaceted repre- nificantly enhance the depth and breadth of our pretraining
sentation of screen content. The schema itself also serves as dataset. By leveraging the natural language processing capa-
a pretraining task, where the model is tasked with generating bilities of LLMs, coupled with the structured screen schema,
a similar schema from a provided input image. This not only we can simulate a wide range of user interactions and scenar-
enhances the model’s capacity to discern and interpret vari- ios. See Appendix D for generated examples.
4 Data Mixtures Task Name #samples
We define two distinct sets of tasks for our model: an initial Generated Screen Annotation
series of pretraining tasks and a subsequent set of fine-tuning Mobile webpages 262M
tasks. The distinction primarily lies in two aspects: Mobile apps 54M
1. Source of the Groundtruth Data: For the fine-tuning Mobile webpages (tall renders) 37M
tasks, the labels are provided or verified by human raters. Generated Screen Question-Answering
For the pretraining tasks, the labels are inferred us- Mobile webpages 9.8M
ing self supervised learning methods or generated using Mobile apps 2.0M
other models. Mobile webpages (tall renders) 2.3M
Desktop webpages 16.4M
2. Size of the Datasets: Typically, the pretraining tasks en-
Infographics 6.3M
compass a significantly larger quantity of samples, and
ChartQA/PlotQA 2.4M
consequently, these tasks are used for training the model
over a more extended series of steps. Generated Screen Navigation
Mobile webpages 2.6M
4.1 Pretraining Mixture Mobile apps 5.9M
Mobile webpages (tall renders) 2.3M
Based on the methodology outlined in Section 3, we have
Desktop webpages 5.1M
selected the following tasks for pretraining our models. These
tasks, each illustrated in Figure 4, are designed to cover a Generated Screen Summarization
wide range of skills and scenarios, endowing our model with Mobile webpages 5.6M
diverse real-world applications. Desktop webpages 7.6M
Other
1. Screen Annotation: The model is tasked with detecting
Tarzan [Xue et al., 2020] 297K
and identifying UI elements present on a screen. This
VQA CC3M [Sharma et al., 2018] 178K
includes performing OCR and image captioning to un-
WebLI Alt and OCR text [Kil et al., 2023] 297K
derstand and interpret the textual and non-textual con-
tent. To enhance the model’s contextual understanding,
Table 2: Detailed breakdown of our pretraining mixture.
some text elements are intentionally masked, encour-
aging the model to infer information based on the sur-
rounding context and layout. sentences. This task assesses the model’s capability to
2. Screen Question-Answering (QA): For this task, the distill and caption the essence of the screen’s content.
model is asked to answer questions related to user in- To ensure comprehensive training robust to aspect ratios,
terfaces and computer-generated images, such as info- each task is made available across multiple formats (mobile
graphics. After initial experiments, we identified certain and desktop) and includes several aspect ratios.
gaps in performance on attributes like arithmetic, count- In addition to these screen-related tasks, our training regi-
ing, understanding images with complex infographics. men also incorporates a variety of other image and text data
To enhance the model capabilities, we create data specif- sources: Span corruption on C4 [Xue et al., 2020], VQA
ically addressing these gaps, e.g., QA involving count- CC3M [Sharma et al., 2018], WebLI Alt and OCR text [Kil
ing, arithmetic operations, and complex data containing et al., 2023; Chen et al., 2022] and Chart-to-table transla-
infographics. For these examples, we first crawl large tion [Liu et al., 2023]. Such datasets have been instrumen-
scale webpage and infographic images, then perform tal in the development of PaLI models [Chen et al., 2022;
prompt tuning to generate and validate relevant ques- Chen et al., 2023b], which serve as the foundational architec-
tions and their answers. For charts, the mix consists of ture for our model. Their inclusion ensures that our model not
1) synthetic data [Liu et al., 2023], 2) UniChart [Masry only excels in screen and infographics understanding but also
et al., 2023], 3) DVQA [Kafle et al., 2018], 4) TaTa maintains robust language and visual processing capabilities.
[Gehrmann et al., 2022], 5) Benetech 2 .
A summary of all our pretraining tasks is shown in Table 2.
3. Screen Navigation: This task involves interpreting nav- In the mixture, datasets are weighted proportionally to their
igation instructions (e.g., ‘go back’) and identifying the size with a maximum allowed weight per task. Incorporating
appropriate UI element to interact with. The expected multimodal sources in our multi-task training, from language
output is the bounding box coordinates of the target ele- processing to visual comprehension and web content analy-
ment, bucketized between 0 and 999, demonstrating the sis, prepares our model to handle diverse scenarios effectively
model’s ability to understand user intent and navigate and enhances its overall versatility and performance.
through interfaces accurately.
4.2 Fine-Tuning Tasks and Benchmarks
4. Screen Summarization: The model is tasked to suc-
cinctly summarize the content of a screen in one or two We use a variety of tasks and benchmarks during fine-tuning
to estimate the quality of our model. These benchmarks are
2 summarized in Table 3 and include the main existing screen,
https://www.kaggle.com/competitions/benetech-making-
graphs-accessible infographics and document understanding benchmarks. We
(a) Screen annotation (b) Question-Answering (c) Navigation (d) Summarization

Text input: Describe this Text input: What is the name Text input: Select the Text input: Summarize this
screenshot. of the tailor? first item in the list. screenshot.
Target: IMAGE pleasure or love Target: Andrew Ramroop Target: click 15 983 199 Target: The screenshot shows a
follows truthfulness then the 359 news article about UConn men's
merciful appears before him 0 basketball recruiting. The
993 0 261 (TEXT pleasure of article is about Dan Hurley's
love, follows truthfulness, first recruit of the 2021
then the Merciful appears class, Rahsool Diggins, a 6'1″
before him 3 991 0 248), IMAGE point guard from Philadelphia.
a ma...

Figure 4: Sample of tasks that we are using in our pretraining mixture: (a) Screen annotation, with masking; (b) Question-Answering; (c)
Navigation; (d) Summarization. The last three have been generated using our screen annotation model, coupled with PaLM-2-S.

make the following changes to task formulations: (1) we alternative short answers, for each of the questions. We
cast RefExp [Wichers et al., 2018] and Task Automation in use the maximum F1 score across all the candidate an-
MoTIF [Burns et al., 2022] as object detection tasks, with- swers as the metric. See Figure 5 and Appendix F for
out using candidate bounding boxes and report accuracy at more details.
IoU=0.13 considering only one box predicted; (2) for MoTIF, • Complex ScreenQA (Cplx SQA):6 To complement
we report the number for the app-unseen split of the test set SQA Short, we introduce Complex ScreenQA, which
in Table 4, and other split results in in Table 5 of Appendix E. includes more difficult questions (counting, arithmetic,
We supplement the tasks mentioned above with three new comparison, and non-answerable questions) and con-
benchmarks that we release: tains screens with various aspect ratios. See Figures 6
• Screen Annotation (SA):4 To evaluate our model’s lay- and 7 for examples and Appendix G for more details.
out annotation and spatial understanding capabilities, we
create a dedicated benchmark consisting of 4.2K screen- We also provide a few additional details on how we handle
shots from the Rico dataset [Deka et al., 2017]. Each UI Multipage DocVQA and ChartQA.
element has been annotated by human raters, and the an- Multipage DocVQA. The standard fine-tuning task for
notations comprise a bounding box and a UI class from Multipage DocVQA [Tito et al., 2023] can be transformed
the list described in 3.1. We evaluate the model’s predic- into a single-page DocVQA task by pairing the same ques-
tions using object detection metrics, including F1 score, tion with each page of the document and choosing the answer
precision and recall values computed at IoU=0.1. with the highest score among all pages. In this formulation,
• ScreenQA Short (SQA Short):5 ScreenQA [Hsiao et we modify the training set by splitting a question, answer and
al., 2022], a benchmark for screen understanding, con- multipage document into a positive pair (with the actual an-
tains UI elements and full-sentence answers as ground swer for the page containing the answer) and multiple nega-
truth. To align the output format with other question an- tive pairs (with “no answer” for pages which do not contain
swering tasks, we generate a new ground truth, a list of the answer). The negative pairs are subsampled to avoid over-
fitting on not predicting an answer and the original DocVQA
3
Intersection over union at threshold 0.1 task [Mathew et al., 2021] is added to the fine-tuning mixture.
4
https://github.com/google-research-datasets/screen annotation
5 6
https://github.com/google-research-datasets/screen qa?tab= https://github.com/google-research-datasets/screen qa?tab=
readme-ov-file#screenqa-short readme-ov-file#complexqa
Question: How many links and comments are there
Task Name/Benchmark Metric of the post ”Why Michael Flynn kept his Job 17 days
after the White House!” ?
Screen Analysis Full sentence answers:
Screen Annotation [Ours, Sec. 4.2] F1@IoU=0.1 • There is 1 like and 1 comment on the post
”Why Michael Flynn kept his job 17 days af-
Widget Captioning [Li et al., 2020] CIDEr ter the White House!”.
Screen Question-Answering • There is 1 like and 1 comment on the ”Why
Michael Flynn kept his Job 17 days after the
ScreenQA Short [Ours, Sec. 4.2] SQuAD F1 White House!” post.
Complex ScreenQA [Ours, Sec. 4.2] SQuAD F1 • There is 1 like and 1 comment.
WebSRC [Chen et al., 2021b] SQuAD F1 List of short answers:
• one and one
Screen Navigation • 1 and 1
• one, one
RefExp [Bai et al., 2021] Acc@IoU=0.1 • 1, 1
MoTIF-Automation [Burns et al., 2022] Acc@IoU=0.1 • 1 like, 1 comment
• 1 like and 1 comment
Screen Summarization
Screen2Words [Wang et al., 2021] CIDEr Figure 5: Examples of questions and answers from the ScreenQA
Infographics/Doc Visual QAs dataset, together with their LLM-generated short answers.
ChartQA [Masry et al., 2022] Relaxed Acc.
DocVQA [Mathew et al., 2021] ANLS
Multipage DocVQA [Tito et al., 2023] ANLS
InfographicVQA [Mathew et al., 2022] ANLS
OCR-VQA-200K [Mishra et al., 2019] Exact Match

Table 3: Detailed breakdown of our fine-tuning mixture and their as-


sociated metrics. We assume readers are familiar with these metrics,
but include descriptions and citations in Appendix A for reference.

ChartQA. Concurrent work in [Carbune et al., 2024]


showed that the original fine-tuning dataset [Masry et al.,
2022] is insufficiently rich for learning solving complex rea-
soning tasks. There, they overcome this limitation through Question: How many Question: How many
Question: How many
days are between the de-
synthetic examples and rationales, paired with training loss songs have a duration of text size options are
parture and return dates?
changes. Here, we leverage the synthetic examples, but with- less than 30 seconds? there?
Answer: There is no an-
Answer: 1 Answer: 5
out modifying the training loss or incorporating rationales. swer on the screen.
We therefore maintain parity how we fine-tune for the rest Figure 6: Examples of mobile screen in Complex QA dataset.
of the tasks. We report similar performance with or with-
out OCR, hinting that the scale of the dataset contributes
more than the input features. Our results otherwise further the model converges within 30k steps. Unless specified oth-
strengthen the contribution of the pre-training and architec- erwise, all experiments are run on the 5B model.
ture changes with pix2struct to better leverage the same syn-
thetic examples and not needing to rely on rationales. 5.2 Results
Table 4 shows the performance of our models and com-
pares them with state-of-the-art (SoTA) results on a variety
5 Experiments and Results of screen- and infographics-related tasks. We also include
In this section, we present the setup we used to conduct our the best results for models of similar size (SoTA<5B). We
experiments and analyze our findings. First, we compare the report new SoTA results on MoTIF, MPDocVQA, and Web-
best performing ScreenAI model to the SoTA on a variety of SRC; and new best-in-class results in ChartQA, DocVQA
Screen and Infographics related tasks. Next, we report the and InfographicVQA (InfoVQA). We report same or com-
impact of model size on overall performance. Finally, we re- petitive performance on Screen2Words, Widget Captioning,
port results on ablation studies to validate the design choices and OCR-VQA. We also report our results on the benchmarks
made for the models. introduced in Section 4.2 (Screen Annotations, Referring Ex-
pressions, ScreenQA Short and Complex ScreenQA).
5.1 Experiments Setup Adding OCR as Additional Input. We analyze the impact
of adding OCR7 to the model input by conducting experi-
In the fine-tuning phase, we hold the ViT encoder frozen and ments with and without OCR. This is inspired by fine-tuning
fine-tune the language model only. We use 512 as our batch experiments in PaLI [Chen et al., 2023b], where across all
size for fine-tuning. Our text input sequence length is 128 and screen- and document-related tasks, passing OCR texts as
output sequence length varies depending on individual tasks.
When fine-tuning with OCR as additional input, we increase 7
We use a proprietary OCR system similar to GCP Vision API to
the input sequence length accordingly. We generally find that produce additional OCR input for each image.
Ref SQA Cplx Screen2 Widget Chart Doc MPDoc Info OCR Web
SA MoTIF
Exp Short SQA Words Capt. QA VQA VQA VQA VQA SRC

SoTA - - - - 67.6a 130.7b 159.8b 80.8h 90.9h 61.8d 80.3h 77.8b 85.0f
Without OCR
SoTA≤5B - - - - 67.6a 130.7b 159.8b 77.3i 87.8c - 57.8b 76.7b 77.8g
ScreenAI 86.2 86.3 94.6 42.4 87.4 120.8 156.4 76.6 87.5 72.9 61.4 75.0 87.2
With OCR
SoTA≤5B - - - - - - - 70.4c 89.3c 61.8d 62.4b 77.8b 85.0f
ScreenAI - - 94.8 43.5 - 123.7 - 76.7 89.9 77.1 65.9 76.2 -

Table 4: Comparison of ScreenAI with various SoTA models: (a) MoTIF [Burns et al., 2022], (b) PaLI-3 [Chen et al., 2023b], (c)
SmoLA PaLI-X [Wu et al., 2023a], (d) Hi-VT5 [Tito et al., 2023], (e) TILT [Powalski et al., 2021], (f) DocPrompt [Wu et al., 2023b],
(g) DUBLIN [Aggarwal et al., 2023], (h) Gemini [Anil et al., 2023a], (i) ChartPaLI-5B [Carbune et al., 2024]. Bold font highlights SoTA
score, and underscore represents best-in-class score. See Table 3 for details about the tasks and their associated metrics.

5.3 Ablation Studies


In this section, we perform ablation studies evaluating (1) the
impact of pix2struct patching and (2) using LLM generated
data for pre-training. All ablation studies are performed on
the 670M parameter variant.
Impact of Pix2struct Patching. For this study, we com-
pare a 670M model using pix2struct patching with another
using fixed-grid patching. After pre-training, both models
are fine-tuned on all tasks in Table 3. We split each dataset
into subsets based on the image aspect ratio and compute the
respective metric on these subsets. To compare fixed-grid
patching to a variable pix2struct patching, we compute an
aggregate score, by first dividing the score of each task sub-
set using fixed-grid patching by the score of the model using
pix2struct on the entire task, and finally compute the geomet-
ric mean across all tasks. Figure 9 shows that for images with
aspect ratio > 1.0 (landscape mode images), the pix2struct
Question: What is the lift capacity at 35%? Answer: 1960 lb.
patching strategy is significantly better than the fixed grid
patching. For portrait mode images, the trend is reversed,
Figure 7: An example of desktop screen in Complex QA dataset. but fixed grid patching is only marginally better. Given that
we want the ScreenAI model to be used across images of dif-
ferent aspect ratios, we choose to use pix2struct patching.
additional input improves task performance. In Table 4 we Impact of LLM Generated Data. For this experiment, we
present our single task fine-tuning results using OCR data. compare a 670M ScreenAI model pre-trained using all the
For QA tasks, OCR input provides a boost in performance datasets mentioned in Section 4.1 against a model pre-trained
(e.g. up to 4.5% on Complex ScreenQA, MPDocVQA and on a mixture excluding any LLM generated pre-training data.
InfoVQA). However, using OCR imposes a slightly larger in- After pre-training, both models are fine-tuned on all tasks
put length and hence results in slower overall training. It also mentioned in Table 3 and an aggregate score is computed.
requires having OCR results available at inference time. We observe that adding LLM generated data to the mixture
improves the aggregate score by 4.6 percentage points.
Model Size. We conducted single task experiments with the
following model sizes: 670M, 2B and 5B. We use bench- 6 Conclusions
marks for screen tasks as well as other public tasks. In Fig- In this work, we introduce the ScreenAI model along with
ure 8, we observe that across all tasks, increasing the model a new unified schema for representing complex data and vi-
size improves performances and the improvements have not sual information, compatible with infographics, document
saturated at the largest size. We observe that for tasks that re- images, and various UIs. This unified representation enables
quire more complex visual-text and arithmetic reasoning e.g. the design of a mixture of self-supervised learning tasks,
InfoVQA, ChartQA, and Complex ScreenQA, the improve- leveraging data from all these domains. We show that train-
ment between 2B and 5B models is significantly larger than ing on this mixture results in a positive transfer to screen-
between 670M and 2B models. related tasks as well as infographics and document-related
120.8
670M 2B 5B
100 94.6 97.4 99.9
Metric value

83.9 86.3 84.8 83.5 86.8 87.4 87.5


81.9 77.4 76.6 76.2
70.0
61.1 59.3 61.4 62.8
54.0 55.8 50.7 54.8
48.2
50 42.4
28.4 29.4
19.6 24.0

0
Screen Annotation Ref Exp SQA Short Complex SQA MoTIF Screen2Words Chart QA DocVQA Infographics VQA OCR VQA

Figure 8: Performance of different model sizes on fine-tuning tasks. The metrics improve consistently as the model size increases.

1.22 1.18 1.19 1.19


1.14 1.10
Aggregate score

0.99 0.99 0.99 0.98


1.0 0.87 0.81 0.88
0.79 0.76
0.69
0.5
Fixed Grid Pix2struct
0.0
(0.0 - 0.25) [0.25 - 0.5) [0.5 - 0.75) [0.75 - 1.0) [1.0 - 1.33) [1.33 - 2.0) [2.0 - 4.0) [4.0 - inf)
Aspect ratio

Figure 9: Ablation study for Pix2Struct vs. fixed-grid patching; the numbers represent the aggregated scores across all fine-tuned tasks. For
aspect ratio > 1.0, using Pix2Struct patching significantly outperforms a fixed grid patching, whereas for aspect ratio < 1.0, a fixed grid
patching outperforms Pix2Struct by a smaller margin.

tasks. We also illustrate the impact of data generation us- DUBLIN–document understanding by language-image network.
ing LLMs and justify our model design choices with ablation arXiv preprint arXiv:2305.14218, 2023.
studies. We apply these techniques to train a model that per- [Aghajanyan et al., 2021] Armen Aghajanyan, Dmytro Okhonko,
forms competitively and achieves SoTA on a number of pub- Mike Lewis, Mandar Joshi, Hu Xu, Gargi Ghosh, and Luke
lic benchmarks. While our model is best-in-class, we note Zettlemoyer. HTLM: Hyper-text pre-training and prompting of
that, on some tasks, further research is needed to bridge the language models, 2021.
gap with models like GPT-4 and Gemini, which are orders [Alayrac et al., 2022] Jean-Baptiste Alayrac, Jeff Donahue, Pauline
of magnitude larger. To encourage further research, we re- Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur
lease a dataset with this unified representation, as well as two Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo:
other datasets to enable more comprehensive benchmarking a visual language model for few-shot learning. Advances in Neu-
of models on screen-related tasks. ral Information Processing Systems, 35:23716–23736, 2022.
[Anil et al., 2023a] Rohan Anil, Sebastian Borgeaud, Yonghui Wu,
Acknowledgements Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalk-
wyk, Andrew M Dai, Anja Hauth, et al. Gemini: a fam-
We would like to thank team alumni Yo Hsiao and Zixian Ma ily of highly capable multimodal models. arXiv preprint
for their contribution to the project, Fangyu Liu, Xi Chen, Efi arXiv:2312.11805, 2023.
Kokiopoulou, Jesse Berent, Gabriel Barcik, Lukas Zilka, Ori-
[Anil et al., 2023b] Rohan Anil, Andrew M Dai, Orhan Firat,
ana Riva, Gang Li, Yang Li, Radu Soricut and Tania Bedrax-
Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak
Weiss for their insightful feedbacks and fruitfull discussions,
Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al.
Rahul Aralikatte, Hao Cheng and Daniel Kim for their whole- PaLM 2 technical report. arXiv preprint arXiv:2305.10403,
hearted and tireless support in data preparation, and Jay Yag- 2023.
nik, Blaise Aguera y Arcas, Ewa Dominowska, David Petrou,
[Bai et al., 2021] Chongyang Bai, Xiaoxue Zang, Ying Xu, Srini-
and Matt Sharifi for their vision and support in leadership.
vas Sunkara, Abhinav Rastogi, Jindong Chen, and Blaise Aguera
y Arcas. UIBert: Learning generic multimodal representations
Contribution Statement for UI understanding, 2021.
First Authors with Equal Contributions: Gilles Baechler, [Burns et al., 2022] Andrea Burns, Deniz Arsan, Sanjna Agrawal,
Srinivas Sunkara, Maria Wang, Jindong Chen. Ranjitha Kumar, Kate Saenko, and Bryan A. Plummer. A dataset
for interactive vision language navigation with unknown com-
Project Leads: Jindong Chen, Abhanshu Sharma mand feasibility. In European Conference on Computer Vision
(ECCV), 2022.
References [Carbune et al., 2024] Victor Carbune, Hassan Mansoor, Fangyu
[Aggarwal et al., 2023] Kriti Aggarwal, Aditi Khandelwal, Kumar Liu, Rahul Aralikatte, Gilles Baechler, Jindong Chen, and Ab-
Tanmay, Owais Mohammed Khan, Qiang Liu, Monojit Choud- hanshu Sharma. Chart-based reasoning: Transferring capabilities
hury, Subhojit Som, Vishrav Chaudhary, and Saurabh Tiwary. from llms to vlms, 2024.
[Carion et al., 2020] Nicolas Carion, Francisco Massa, Gabriel [Huang et al., 2022] Yupan Huang, Tengchao Lv, Lei Cui, Yutong
Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Lu, and Furu Wei. LayoutLMv3: Pre-training for document
Zagoruyko. End-to-end object detection with transformers. ai with unified text and image masking. In Proceedings of the
In European conference on computer vision, pages 213–229. 30th ACM International Conference on Multimedia, pages 4083–
Springer, 2020. 4091, 2022.
[Chen et al., 2021a] Ting Chen, Saurabh Saxena, Lala Li, David J [Kafle et al., 2018] Kushal Kafle, Brian Price, Scott Cohen, and
Fleet, and Geoffrey Hinton. Pix2seq: A language mod- Christopher Kanan. Dvqa: Understanding data visualizations via
eling framework for object detection. arXiv preprint question answering, 2018.
arXiv:2109.10852, 2021. [Kil et al., 2023] Jihyung Kil, Soravit Changpinyo, Xi Chen, Hexi-
[Chen et al., 2021b] Xingyu Chen, Zihan Zhao, Lu Chen, Danyang ang Hu, Sebastian Goodman, Wei-Lun Chao, and Radu Soricut.
Zhang, Jiabao Ji, Ao Luo, Yuxuan Xiong, and Kai Yu. WebSRC: PreSTU: Pre-training for scene-text understanding. In Proceed-
A dataset for web-based structural reading comprehension, 2021. ings of the IEEE/CVF International Conference on Computer Vi-
[Chen et al., 2022] Xi Chen, Xiao Wang, Soravit Changpinyo, sion, pages 15270–15280, 2023.
AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Good- [Kim et al., 2021] Geewook Kim, Teakgyu Hong, Moonbin Yim,
man, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. PaLi: Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo
A jointly-scaled multilingual language-image model. arXiv Yun, Dongyoon Han, and Seunghyun Park. Donut: Docu-
preprint arXiv:2209.06794, 2022. ment understanding transformer without OCR. arXiv preprint
[Chen et al., 2023a] Xi Chen, Josip Djolonga, Piotr Padlewski, arXiv:2111.15664, 7:15, 2021.
Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme [Kuo et al., 2023] Weicheng Kuo, AJ Piergiovanni, Dahun Kim,
Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, et al. PaLI-X: Xiyang Luo, Ben Caine, Wei Li, Abhijit Ogale, Luowei Zhou,
On scaling up a multilingual vision and language model. arXiv Andrew Dai, Zhifeng Chen, et al. MaMMUT: A simple archi-
preprint arXiv:2305.18565, 2023. tecture for joint learning for multimodal tasks. arXiv preprint
arXiv:2303.16839, 2023.
[Chen et al., 2023b] Xi Chen, Xiao Wang, Lucas Beyer, Alexander
Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil Mustafa, Sebas- [Lee et al., 2023] Kenton Lee, Mandar Joshi, Iulia Raluca Turc,
tian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, et al. Hexiang Hu, Fangyu Liu, Julian Martin Eisenschlos, Ur-
PaLI-3 vision language models: Smaller, faster, stronger. arXiv vashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina
preprint arXiv:2310.09199, 2023. Toutanova. Pix2struct: Screenshot parsing as pretraining for vi-
sual language understanding. In International Conference on Ma-
[Deka et al., 2017] Biplab Deka, Zifeng Huang, Chad Franzen,
chine Learning, pages 18893–18912. PMLR, 2023.
Joshua Hibschman, Daniel Afergan, Yang Li, Jeffrey Nichols,
and Ranjitha Kumar. Rico: A mobile app dataset for building [Li and Li, 2022] Gang Li and Yang Li. Spotlight: Mobile UI un-
data-driven design applications. In Proceedings of the 30th an- derstanding using vision-language models with a focus. arXiv
nual ACM symposium on user interface software and technology, preprint arXiv:2209.14927, 2022.
pages 845–854, 2017. [Li et al., 2020] Yang Li, Gang Li, Luheng He, Jingjie Zheng,
[Deng et al., 2023] Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Hong Li, and Zhiwei Guan. Widget captioning: Generating natu-
Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. ral language description for mobile user interface elements, 2020.
Mind2web: Towards a generalist agent for the web. arXiv [Li et al., 2021] Yang Li, Gang Li, Xin Zhou, Mostafa Dehghani,
preprint arXiv:2306.06070, 2023. and Alexey Gritsenko. VUT: Versatile ui transformer for
[Dosovitskiy et al., 2020] Alexey Dosovitskiy, Lucas Beyer, multi-modal multi-task user interface modeling. arXiv preprint
Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, arXiv:2112.05692, 2021.
Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, [Li et al., 2022a] Gang Li, Gilles Baechler, Manuel Tragut, and
Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 Yang Li. Learning to denoise raw mobile UI layouts for improv-
words: Transformers for image recognition at scale. arXiv ing datasets at scale. In Proceedings of the 2022 CHI Conference
preprint arXiv:2010.11929, 2020. on Human Factors in Computing Systems, pages 1–13, 2022.
[Gehrmann et al., 2022] Sebastian Gehrmann, Sebastian Ruder, Vi- [Li et al., 2022b] Tao Li, Gang Li, Jingjie Zheng, Purple Wang, and
taly Nikolaev, Jan A. Botha, Michael Chavinda, Ankur Parikh, Yang Li. MUG: Interactive multimodal grounding on user inter-
and Clara Rivera. Tata: A multilingual table-to-text dataset for faces, 2022.
african languages, 2022. [Liu et al., 2022] Fangyu Liu, Francesco Piccinno, Syrine Krich-
[Gur et al., 2022] Izzeddin Gur, Ofir Nachum, Yingjie Miao, ene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun,
Mustafa Safdari, Austin Huang, Aakanksha Chowdhery, Sha- Nigel Collier, and Julian Martin Eisenschlos. MatCha: Enhanc-
ran Narang, Noah Fiedel, and Aleksandra Faust. Under- ing visual language pretraining with math reasoning and chart
standing HTML with large language models. arXiv preprint derendering. arXiv preprint arXiv:2212.09662, 2022.
arXiv:2210.03945, 2022. [Liu et al., 2023] Fangyu Liu, Julian Martin Eisenschlos, Francesco
[He et al., 2021] Zecheng He, Srinivas Sunkara, Xiaoxue Zang, Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar
Ying Xu, Lijuan Liu, Nevan Wichers, Gabriel Schubiner, Ruby Joshi, Wenhu Chen, Nigel Collier, and Yasemin Altun. DePlot:
Lee, Jindong Chen, and Blaise Agüera y Arcas. ActionBert: One-shot visual language reasoning by plot-to-table translation,
Leveraging user actions for semantic understanding of user in- 2023.
terfaces, 2021. [Masry et al., 2022] Ahmed Masry, Do Xuan Long, Jia Qing Tan,
[Hsiao et al., 2022] Yu-Chung Hsiao, Fedir Zubach, Maria Wang, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for
et al. ScreenQA: Large-scale question-answer pairs over mobile question answering about charts with visual and logical reason-
app screenshots. arXiv preprint arXiv:2209.08199, 2022. ing. arXiv preprint arXiv:2203.10244, 2022.
[Masry et al., 2023] Ahmed Masry, Parsa Kavehzadeh, Xuan Long learning paradigms. In The Eleventh International Conference on
Do, Enamul Hoque, and Shafiq Joty. Unichart: A universal Learning Representations, 2022.
vision-language pretrained model for chart comprehension and [Tito et al., 2023] Rubèn Tito, Dimosthenis Karatzas, and Ernest
reasoning, 2023. Valveny. Hierarchical multimodal transformers for multipage
[Mathew et al., 2021] Minesh Mathew, Dimosthenis Karatzas, and DocVQA. Pattern Recognition, 144:109834, 2023.
CV Jawahar. DocVQA: A dataset for VQA on document images. [Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki Par-
In Proceedings of the IEEE/CVF winter conference on applica- mar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
tions of computer vision, pages 2200–2209, 2021. Kaiser, and Illia Polosukhin. Attention is all you need. Advances
[Mathew et al., 2022] Minesh Mathew, Viraj Bagal, Rubèn Tito, in neural information processing systems, 30, 2017.
Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. In- [Vedantam et al., 2015] Ramakrishna Vedantam, C. Lawrence Zit-
fographicVQA. In Proceedings of the IEEE/CVF Winter Con- nick, and Devi Parikh. CIDEr: Consensus-based image descrip-
ference on Applications of Computer Vision, pages 1697–1706, tion evaluation, 2015.
2022.
[Wang et al., 2021] Bryan Wang, Gang Li, Xin Zhou, Zhourong
[Methani et al., 2020] Nitesh Methani, Pritha Ganguly, Mitesh M. Chen, Tovi Grossman, and Yang Li. Screen2words: Automatic
Khapra, and Pratyush Kumar. PlotQA: Reasoning over scientific mobile ui summarization with multimodal learning. In The 34th
plots, 2020. Annual ACM Symposium on User Interface Software and Tech-
[Mishra et al., 2019] Anand Mishra, Shashank Shekhar, Ajeet Ku- nology, pages 498–510, 2021.
mar Singh, and Anirban Chakraborty. OCR-VQA: Visual ques- [Wang et al., 2022] Peng Wang, An Yang, Rui Men, Junyang Lin,
tion answering by reading text in images. In ICDAR, 2019. Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren
[Nakano et al., 2021] Reiichiro Nakano, Jacob Hilton, Suchir Bal- Zhou, and Hongxia Yang. OFA: Unifying architectures, tasks,
aji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, and modalities through a simple sequence-to-sequence learning
Shantanu Jain, Vineet Kosaraju, William Saunders, et al. We- framework. In International Conference on Machine Learning,
bGPT: Browser-assisted question-answering with human feed- pages 23318–23340. PMLR, 2022.
back. arXiv preprint arXiv:2112.09332, 2021. [Wang et al., 2023] Dongsheng Wang, Natraj Raman, Mathieu
[Powalski et al., 2021] Rafał Powalski, Łukasz Borchmann, Dawid Sibue, Zhiqiang Ma, Petr Babkin, Simerjot Kaur, Yulong Pei,
Jurkiewicz, Tomasz Dwojak, Michał Pietruszka, and Gabriela Armineh Nourbakhsh, and Xiaomo Liu. DocLLM: A layout-
Pałka. Going full-tilt boogie on document understanding with aware generative language model for multimodal document un-
text-image-layout transformer, 2021. derstanding. arXiv preprint arXiv:2401.00908, 2023.
[Raffel et al., 2020] Colin Raffel, Noam Shazeer, Adam Roberts, [Wichers et al., 2018] Nevan Wichers, Dilek Hakkani-Tür, and Jin-
Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, dong Chen. Resolving referring expressions in images with la-
Wei Li, and Peter J Liu. Exploring the limits of transfer learning beled elements. In 2018 IEEE Spoken Language Technology
with a unified text-to-text transformer. The Journal of Machine Workshop (SLT), pages 800–806. IEEE, 2018.
Learning Research, 21(1):5485–5551, 2020. [Wu et al., 2021] Jason Wu, Xiaoyi Zhang, Jeff Nichols, and Jef-
[Rajpurkar et al., 2016] Pranav Rajpurkar, Jian Zhang, Konstantin frey P Bigham. Screen parsing: Towards reverse engineering
Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for ma- of ui models from screenshots. In The 34th Annual ACM Sympo-
chine comprehension of text, 2016. sium on User Interface Software and Technology, pages 470–483,
2021.
[Rawles et al., 2023] Christopher Rawles, Alice Li, Daniel Ro-
driguez, Oriana Riva, and Timothy Lillicrap. Android in the wild: [Wu et al., 2023a] Jialin Wu, Xia Hu, Yaqing Wang, Bo Pang, and
A large-scale dataset for android device control. arXiv preprint Radu Soricut. Omni-SMoLA: Boosting generalist multimodal
arXiv:2307.10088, 2023. models with soft mixture of low-rank experts, 2023.
[Sharma et al., 2018] Piyush Sharma, Nan Ding, Sebastian Good- [Wu et al., 2023b] Sijin Wu, Dan Zhang, Teng Hu, and Shikun
man, and Radu Soricut. Conceptual captions: A cleaned, hy- Feng. DocPrompt: Large-scale continue pretrain for zero-shot
pernymed, image alt-text dataset for automatic image captioning. and few-shot document question answering, 2023.
In Proceedings of the 56th Annual Meeting of the Association [Xue et al., 2020] Linting Xue, Noah Constant, Adam Roberts, Mi-
for Computational Linguistics (Volume 1: Long Papers), pages hir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and
2556–2565, 2018. Colin Raffel. mT5: A massively multilingual pre-trained text-
[Sunkara et al., 2022] Srinivas Sunkara, Maria Wang, Lijuan Liu, to-text transformer. arXiv preprint arXiv:2010.11934, 2020.
Gilles Baechler, Yu-Chung Hsiao, Abhanshu Sharma, James [Yang et al., 2022] Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xi-
Stout, et al. Towards better semantic understanding of mobile aowei Hu, Faisal Ahmed, Zicheng Liu, Yumao Lu, and Lijuan
interfaces. arXiv preprint arXiv:2210.02663, 2022. Wang. UniTAB: Unifying text and box outputs for grounded
[Tang et al., 2023] Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei vision-language modeling. In European Conference on Com-
Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, and puter Vision, pages 521–539. Springer, 2022.
Mohit Bansal. Unifying vision, text, and layout for universal [Zang et al., 2021] Xiaoxue Zang, Ying Xu, and Jindong Chen.
document processing. In Proceedings of the IEEE/CVF Confer- Multimodal icon annotation for mobile applications. In Pro-
ence on Computer Vision and Pattern Recognition, pages 19254– ceedings of the 23rd International Conference on Mobile Human-
19264, 2023. Computer Interaction, pages 1–11, 2021.
[Tay et al., 2022] Yi Tay, Mostafa Dehghani, Vinh Q Tran, Xavier [Zhang et al., 2021] Xiaoyi Zhang, Lilian de Greef, Amanda
Garcia, Jason Wei, Xuezhi Wang, Hyung Won Chung, Dara Swearngin, Samuel White, Kyle Murray, Lisa Yu, Qi Shan, Jef-
Bahri, Tal Schuster, Steven Zheng, et al. UL2: Unifying language frey Nichols, Jason Wu, Chris Fleizach, et al. Screen recognition:
Creating accessibility metadata for mobile applications from pix-
els. In Proceedings of the 2021 CHI Conference on Human Fac- TOOLBAR 0 998 31 113 (
PICTOGRAM arrow backward 0 135 32 112
tors in Computing Systems, pages 1–15, 2021. TEXT Sacramento, CA 179 549 57 90)
TEXT H16 45 113 115 136
LIST_ITEM 0 994 164 611 (
IMAGE a window with a black curtain 0 91 177 466
IMAGE one building with a few palm trees surrounding it .

Appendix 111 891 173 464


TEXT 1/10 863 970 187 223
IMAGE a building with a few people walking by 900 996 188 472
TEXT $1,915 - $2,115 47 430 482 521

A Definitions of Metrics TEXT • 1 Bed 1 Bath 45 299 529 556


PICTOGRAM call 634 717 482 529
PICTOGRAM envelope 744 826 483 528
PICTOGRAM heart 868 956 480 527
We describe below the two categories of metrics that we use in our TEXT THE EISLEY 46 218 563 587
BUTTON VERIFIED 765 956 557 587)
fine-tuning benchmarks. BUTTON VERIFIED 515 703 113 137 (
PICTOGRAM check 523 561 114 135)
BUTTON ONLINE TOURS 713 953 111 136
Metrics for object detection tasks. For tasks involving the LIST_ITEM 1 98 624 926 (
IMAGE a concrete walkway with a concrete planter 0 93 630 925)
predictions of bounding boxes (UI elements), we use the standard LIST_ITEM 105 882 626 917 (
IMAGE one kitchen with a bar and a microwave . 115 876 629 916)
object detection approach, which consists of first matching the pre- LIST_ITEM 896 999 630 921 (
TEXT 1/10 870 966 639 673
dicted bounding boxes with the ground truth, and then computing IMAGE a couch with a picture on the wall 900 999 640 919)

various metrics from these matches. We set the Intersection over


Union (IoU) threshold to 0.1, and we perform the matching per IMAGE a white bowl with a chicken curry and vegetables . 0 994 4 373 (
class, not globally. The metrics used in this paper are: NAVIGATION_BAR 1 996 34 109 (
PICTOGRAM arrow backward 36 148 43 105
PICTOGRAM three dots 853 966 41 107)
)
1. F1@IoU=0.1 - F1 score (harmonic mean of the precision and TEXT Akakiko Limassol 39 695 411 469
recall) at IoU threshold 0.1. PICTOGRAM heart 857 959 409 467
TEXT Easy Japanese fusion dining! 40 574 493 524
LIST_ITEM 0 994 560 625 (
PICTOGRAM happy face 35 86 577 606
2. Acc@IoU=0.1 - Top-1 accuracy at IoU threshold 0.1. TEXT Excellent 8.8 130 339 579 607)
LIST_ITEM 1 991 628 694 (

Metrics for benchmarks where output is plain text. For PICTOGRAM time 34 87 645 675
TEXT Closed Opens at 12:00 128 518 647 676
BUTTON More info 745 959 636 685)
all other tasks, we use the following metrics: LIST_ITEM 4 988 697 763 (
PICTOGRAM 743 714 87 35
TEXT Schedule for later 129 420 715 744
1. CIDEr - Consensus-based Image Description Evalua- BUTTON Change 778 957 704 754)

tion [Vedantam et al., 2015]. TEXT Unfortunately, this restaurant does not 94 733 811 839
TEXT deliver to your location 90 460 842 868
BUTTON OK 782 931 807 870
PICTOGRAM sad face 475 522 840 867
2. SQuAD F1 - F1 score (harmonic mean of the precision and TEXT Search AkⱭkiKU LilliASSOT 98 603 904 921
NAVIGATION_BAR 0 997 933 999 (
recall) after applying SQuAD (Stanford Question Answering PICTOGRAM arrow backward 187 254 948 984
Dataset) [Rajpurkar et al., 2016] text pre-processing. PICTOGRAM a gray circle with a white background 471 532 951 983
PICTOGRAM nav bar rect 752 809 951 982)

3. Relaxed accuracy [Methani et al., 2020], TOOLBAR 0 999 31 100 (


PICTOGRAM arrow backward 28 92 50 88
4. ANLS - Average Normalized Levenshtein Similarity [Mathew TEXT The Best Pet Friendly Hotels in Virginia Beac... 117 983 52 83)
TEXT Pet-friendly Hotels 56 598 99 136
et al., 2021]. LIST_ITEM 79 920 168 229 (
TEXT Tonight 117 623 185 211
TEXT Jun 9 - Jun 10 652 886 185 211)
LIST_ITEM 82 919 230 291 (
5. Exact Match(EM) - See https://github.com/huggingface/ TEXT Tomorrow night 119 610 247 273
datasets/tree/main/metrics/exact match#readme for definition TEXT Jun 10 - Jun 11 631 887 248 274)
LIST_ITEM 81 920 292 352 (
of Exact Match. TEXT This weekend 118 609 308 334
TEXT Jun 11 - Jun 13 630 889 308 335)
LIST_ITEM 70 931 355 413 (
TEXT Next weekend 121 614 370 395
TEXT Jun 18 Jun 20 635 887 370 396)

B Screen Schema Examples MAP 40 960 481 625


BUTTON Show map 35 964 616 676
TEXT Top Virginia Beach Pet-friendly Hotels 36 959 707 788
TEXT See more Pet-friendly Hotels in Virginia Beach 34 756 800 827
Figure 10 shows examples of the screen schema used in most of our BUTTON Choose your dates 31 968 866 914
NAVIGATION_BAR 0 998 934 999 (
pretraining tasks. Each schema contains: PICTOGRAM arrow backward 185 256 945 983
PICTOGRAM nav bar circle 464 534 947 983
PICTOGRAM nav bar rect 742 812 950 983)
1. The UI element names.
2. The OCR text (when applicable).
TEXT Hamleys Inbox Inspiration 36 451 52 81
TEXT Subscribe to hear about new products and stores. 92 901 149 177
3. The element descriptions (e.g. captioning, or the icon name). TEXT_INPUT 50 728 194 254 (
LABEL Email Id 85 241 209 239)
BUTTON Sign Up 731 951 199 251
4. The bounding box coordinates, quantized and normalized be- LIST_ITEM 143 359 321 476 (
PICTOGRAM * 175 314 330 404
tween 0 and 999. TEXT Quality promise 176 327 415 469)
LIST_ITEM 386 612 322 477 (
PICTOGRAM . 422 573 330 407
Parentheses are used to create a basic hierarchical structure between TEXT Free delivery 422 576 416 470)
LIST_ITEM 637 854 327 475 (
the elements, i.e. the children of a parent element are all put inside PICTOGRAM a truck with an arrow going to the right 684 825 337 406
TEXT Easy Return 679 818 416 469)
PICTOGRAM मे 455 593 494 559
a parenthesis block. For ease of visualization, the bounding boxes IMAGE a cartoon of a man walking in front of a cityscape. 4 993 548 773
TEXT follow us at: 172 370 793 821
from the screen schema have been overlaid on the original screen- PICTOGRAM facebook 392 459 774 820
PICTOGRAM twitter 498 576 775 819
shot. PICTOGRAM (O 606 685 776 820
PICTOGRAM play 713 789 779 817
PICTOGRAM C 197 234 832 853
TEXT Hamleys 2021 All Rights Reserved. 222 804 830 854
NAVIGATION_BAR 1 997 859 925 (
C Prompts For LLM Generated Content BUTTON Add to Bag 38 468 868 917
BUTTON Buy Now 504 965 867 918)
NAVIGATION_BAR 0 997 934 999 (
PICTOGRAM arrow backward 190 253 949 983
In this section, we present some of the prompts used as input to PICTOGRAM a gray circle 472 530 951 982
PICTOGRAM nav bar rect 754 810 951 982)
LLMs like PaLM 2-S [Anil et al., 2023b] to generate data for screen
question answering, screen navigation and screen summarization
tasks. In addition to the prompt, we also pass as input to the LLM Figure 10: Examples of our screen schema.
the screen annotation schema described in Appendix B.
C.1 Screen Question Answering Model App Seen App Unseen
You only speak JSON. Do not write text Baseline 66.3 67.6
that isn’t JSON. ScreenAI 87.7 87.8
You are given the following mobile
screenshot, described in words. Can you Table 5: Metrics on different splits of MoTIF [Burns et al., 2022]
generate 5 questions regarding the Task Automation.
content of the screenshot as well as the
corresponding short answers to them? The
answer should be as short as possible, D Screen Navigation Generated Examples
containing only the necessary information
. Your answer should be structured as We present a few examples for the Screen Navigation task generated
follows: using LLMs in Figure 11. More details about the data generation
questions: [ process can be found in Section 3.
{{question: the question,
answer: the answer E MoTIF Evaluation Results
}}, ...]
In this section, we present the ScreenAI model metrics on the differ-
{THE SCREEN SCHEMA}
ent splits of the MoTIF [Burns et al., 2022] task automation dataset.
The metrics breakdown can be seen in Table 5.
C.2 Screen Navigation
You only speak JSON. Do not write text F ScreenQA Short Answers Generation
that isn’t JSON. You are given a mobile We describe below the motivation behind producing a list instead of
screenshot, described in words. Each UI a single short answer as a new ground truth for the ScreenQA [Hsiao
element has a class, which is expressed et al., 2022] dataset, as well as the generation details.
in capital letter. The class is sometimes There are many ways to represent the same information. For
followed by a description, and then 4 example, ”25.01.2023”, ”25th of January 2023” and ”January 25,
numbers between 0 and 999 represent the 2023” are representing the same date, and the model should not be
quantized coordinates of each element. penalized for choosing one representation over the others. A list of
Generate {num_samples} single-step various representations of the same factual answer allows this.
navigation instructions and their A variant of the PaLM 2-S [Anil et al., 2023b] was used to gener-
corresponding answers based on the ate this list of short answers in a few-shot setting. We give as input
screenshot. Each answer should always to the LLM text information from the ScreenQA dataset (question,
start with ‘click‘, followed by the list of UI elements descriptions and full-sentence answer) in addi-
coordinates of the element to click on, e tion to the prompts described in Appedix F.1 and F.2. The gener-
.g. ‘click 0 137 31 113‘. ated lists were then verified by simple heuristics and eyeballing of
Be creative with the questions, do not random samples. See examples of questions and answers from the
always use the same wording, refer to the ScreenQA task, together with their LLM-generated short answers,
UI elements only indirectly, and use in Figure 12.
imperative tense. Your answer should be
structured as in the example below: F.1 For answers contained in a single UI element
For each entry in the ScreenQA dataset where there is only one UI
"questions": [ element in the ground truth, we use the following prompt with the
{{"question": "the question", PaLM 2-S model [Anil et al., 2023b] to generate a list of short an-
"answer": "click 0 137 31 113" swers from the question, list of elements, and the full-sentence an-
}}, swer:
...
] List various ways to rephrase the answer. The
{THE SCREEN SCHEMA} answer should be as short as possible,
without extra words from the question. Use
all provided elements in each answer. Provide
C.3 Screen Summarization the output in square brackets.
You only speak JSON. Do not write text
that isn’t JSON. Here is an example:
You are given the following mobile Question: ’What’s the percentage of humidity
screenshot, described in words. ?’
Generate a summary of the screenshot in Answer elements: [’65%
2-3 sentences. Do not focus on Full answer: ’The humidity is 65%
specifically naming the various UI Rephrases: [’65%
elements, but instead, focus on the
content. Your answer should be structured Here is another example:
as follows: Question: ’What is the gender?’
"summary": the screen summary Answer elements: [’Male’]
{THE SCREEN SCHEMA} Full answer: ’The gender is male.’
Rephrases: [’male’]
Question:
Command: Tap the item about the ● What is the status of “Enable security code”?
Duncan Campbell exhibition Full-sentence answers:
● The status of “Enable security code” is “off”.
● The status is “off”.
LLM-generated short answers:
● off
● disabled

Command: Complete your order Question:


● What is the count of calories?
Full-sentence answers:
● There are 0 calories.
● The count of calories is 0.
● The calorie count is 0.
LLM-generated short answers:
●0
● zero
● no calories

Command: Click on the contact info Question:


● How many likes and comments are there of the
post “Why Michael Flynn kept his Job 17 days
after the White House!”?
Full-sentence answers:
● There is 1 like and 1 comment on the post
“Why Michael Flynn kept his job 17 days
after the White House!”.
● There is 1 like and 1 comment on
the “Why Michael Flynn kept his Job 17
days after the White House!” post.
● There is 1 like and 1 comment.
LLM-generated short answers:
● one and one
● 1 and 1
● one, one
● 1, 1
● 1 like, 1 comment
● 1 like and 1 comment

Command: Open the menu Question:


● What is the phone number?
Full-sentence answers:
● The phone number is 415-579-1638.
● The phone number is +1 415-579-1638.
● The phone number is 4155791638.
LLM-generated short answers:
● 4155791638
● +1 415-579-1638
● 415-579-1638

Figure 11: Examples of Screen Navigation data generated using an Figure 12: Examples of questions and answers from the ScreenQA
LLM. The target bounding box is highlighted in red. dataset, together with their LLM-generated short answers.
Annotation output from the best ScreenAI VLM. For each dataset,
Here is another example: the prompts are chosen to target certain types of questions. With
Question: ’What is the status of "24 hr clock this approach, we generate large scale datasets for desktop, mobile,
"?’ mobile with different aspect ratios, and infographics screens. These
Answer elements: [’on’] datasets are used both for pre-training and evaluation. We add an
Full answer: ’The status is "on".’ additional step of human raters verification for the evaluation data.
Rephrases: [’on’, ’enabled’] Figure 13 and Figure 14 show a few examples of LLM-generated
QA data that was verified by humans.
[...] We distinguish three different subsets, each focusing on solving
the various challenges we identified with this task:
Now is your turn.
• Desktop QA and Long Webpage QA: Datasets on desktop
Question: {THE QUESTION}
screens and long (viewport height) webpages, respectively.
Answer elements: {THE UI ELEMENT DESCRIPTION}
The aspect ratio and size of the input images is very different
Full answer: {THE FULL-SENTENCE ANSWER}
compared to other QA datasets.
Rephrases:
• Complex QA datasets: Datasets mainly focused on counting,
F.2 For answers contained in multiple UI elements arithmetic, and comparison operations requiring information
from more than one part of the screen.
For each entry in the ScreenQA dataset where there are more than
one UI elements in the ground truth, we use the following prompt – Complex QA: Mobile app screens
with the PaLM 2-S model [Anil et al., 2023b] to generate a list – Desktop Complex QA: Desktop screens.
of short answers from the question, list of UI elements and full- – Long Webpage Complex QA: Long webpages.
sentence answer:
• Non Answerable QA: Dataset focused on measuring the abil-
List various ways to rephrase the answer. The ity of the model to know when a question cannot be answered
answer should be as short as possible, from the given screen.
without extra words from the question. Use
all provided elements in each answer. Provide
the output in square brackets. H New Benchmarks Repositories
We release three evaluation datasets for tasks described in Sec-
Here is an example: tion 4.2:
Question: ’What’s the temperature?’
Answer elements: [’59’, ’◦ F’] • Screen Annotation (SA): https://github.com/google-
Full answer: ’The temperature is 59 degrees research-datasets/screen annotation
Fahrenheit.’ • ScreenQA Short (SQA Short): https://github.com/google-
Rephrases: [’59◦ F’, ’59 Fahrenheits’, ’59 research-datasets/screen qa?tab=readme-ov-file#screenqa-
degrees Fahrenheit’] short

Here is another example: • Complex ScreenQA (Cplx SQA): https://github.com/


Question: ’What is the name?’ google-research-datasets/screen qa?tab=readme-ov-
Answer elements: [’Jon’, ’Brown’] file#complexqa
Full answer: ’The name is Jon Brown.’
Rephrases: [’Jon Brown’]

Here is another example:


Question: ’What is the rest interval duration
?’
Answer elements: [’00’, ’:’, ’34’]
Full answer: ’The rest interval lasts 00:34.’
Rephrases: [’00:34’, ’34 seconds’, ’0 minutes
and 34 seconds’, ’34 minutes’, ’0 hours and
34 minutes’]

[...]

Now is your turn.


Question: {THE QUESTION}
Answer elements: {THE FIRST UI ELEMENT
DESCRIPTION, ...}
Full answer: {THE FULL-SENTENCE ANSWER}
Rephrases:

G Complex Question Answering Datasets


The Complex QA datasets contain machine-generated questions us-
ing LLMs like PaLM 2-S [Anil et al., 2023b] based on the Screen
Question: How many days are between
the departure and return dates?

Answer: There is no answer on the


screen.

Question: How many songs have a


duration of less than 30 seconds?

Answer: 1

Question: What is the lift capacity at 35%?


Answer: 1960 lb.

Question: How many more unread


messages are there in the All section
compared to the Private section?

Answer: 2

Question: How many text size options


are there?

Answer: 5

Question:
How many offices does Pioneer Cardiovascular have?
Answer: 4

Figure 13: Examples of mobile Complex QA evaluation examples. Figure 14: Examples of desktop Complex QA evaluation examples.

You might also like