ScreenAI A Vision-Language Model For UI and Infographics Understanding
ScreenAI A Vision-Language Model For UI and Infographics Understanding
Gilles Baechler ∗ , Srinivas Sunkara ∗ , Maria Wang ∗ , Fedir Zubach , Hassan Mansoor ,
Vincent Etter , Victor Cărbune , Jason Lin , Jindong Chen ∗† , Abhanshu Sharma †
Google DeepMind
Accepted for presentation at the International Joint Conference on Artificial Intelligence (IJCAI), 2024
Screen
T5 Multimodal T5 Decoder
Encoder
Vision embed
pix2struct
Encoder + xN Self-attn
patching Cross-attn
(ViT) concat
+ FFW xN
Cross-attn
K, V + FFW
4x6
5x5
Figure 1: The overall architecture of our model. The model contains an image encoder followed by a multimodal encoder consuming
embedded text and image features. The output of the multimodal encoder is fed to an autoregressive decoder to generate the final text output.
This figure also illustrates pix2struct patching, where the grid size adapts to the aspect ratio and shape of the image.
or close-to-best performance. We show in Section 5.2 that for solving various document-understanding tasks (e.g. Lay-
the model performance gets better as we increase its size, outLMv3 [Huang et al., 2022], Donut [Kim et al., 2021],
suggesting that there is a strong potential for further gains pix2struct [Lee et al., 2023], MatCha [Liu et al., 2022],
in performance by scaling up the model. UDOP [Tang et al., 2023], or Spotlight [Li and Li, 2022]).
Another example is VuT [Li et al., 2021], which is made of
1.1 Related Work a multimodal encoder, followed by a text decoder and a ded-
We identify three categories of closely related works. icated head for object detection tasks.
Screen-Based UI Models. Until recently, most screen un- Other approaches like UIBert [Bai et al., 2021], Do-
derstanding efforts focused on well-defined tasks with a nar- cLLM [Wang et al., 2023] perform screen- and document-
row scope. Examples include the detection of icons [Zang understanding using additional textual data extracted from
et al., 2021] or various UI elements [Zhang et al., 2021; metadata like DOM or ancillary models like OCR.
Sunkara et al., 2022; Li et al., 2022a], together with their In our paper, we introduce pre-training tasks along with
structure [Wu et al., 2021]. Other notable works encompass a data generation schema using self-supervision and model-
the description of icons (widget captioning) [Li et al., 2020], based annotation. Prior work with self-supervised learning
screen summarization [Wang et al., 2021], and single-step tasks have typically been focused on one domain. For ex-
navigation tasks [Wichers et al., 2018; Li et al., 2022b]. An- amples, pix2struct [Lee et al., 2023], HTLM [Aghajanyan
other direction is to use LLMs to classify and describe UI et al., 2021] are focused on web-pages; ActionBert [He et
elements [Gur et al., 2022], or complete tasks [Nakano et al., al., 2021], UIBert [Bai et al., 2021] are focused on mobile
2021; Rawles et al., 2023; Deng et al., 2023]. apps, which can capture a subset of the elements like text
and exclude hierarchy information. Our representation, in-
Generalist Foundation Models. The advent of large foun- ferred from only screen or image pixels, is applicable to a
dation models, particularly in the multimodal domain, has led wide range of domains beyond web-pages and mobile apps,
to the development of versatile and unified models. These including documents, infographics, etc. Compared to prior
universal models excel in a broad spectrum of image un- work, our model achieves superior performance on down-
derstanding tasks formulated through natural language such stream tasks. We hypothesize this is due to the positive trans-
as question-answering, image captioning, and object local- fer of performance when using screen, document and info-
ization. (e.g. UniTAB [Yang et al., 2022], OFA [Wang graphics data jointly in the pre-training mixture. Given the
et al., 2022], PaLI [Chen et al., 2022; Chen et al., 2023a; abundance of data in each of these domains, we believe future
Chen et al., 2023b], Flamingo [Alayrac et al., 2022], or research in this direction can result in further improvements.
MaMMUT [Kuo et al., 2023]). Foundational work also in-
cludes pix2seq [Chen et al., 2021a], which recasts the object
detection problem as a text prediction task.
2 Methodology
Efficient Vision-Language Models. Closer to the domain 2.1 Architecture
of screen and document understanding, similar transformer- Our model architecture as shown in Figure 1 is inspired by the
based [Vaswani et al., 2017] architectures have been proposed architecture of the PaLI family of models [Chen et al., 2022;
Chen et al., 2023a; Chen et al., 2023b], which is composed Model ViT Encoder-Decoder #params
of a multimodal encoder block with a vision encoder like
ViT [Dosovitskiy et al., 2020] and a mT5 [Xue et al., 2020; 670M B16 (92M) mT5 base (583M) 675M
Raffel et al., 2020] language encoder consuming image and 2B H14 (653M) mT5 Large (1.23B) 1.88B
text inputs, followed by an autoregressive decoder. The input 5B G14 (1.69B) UL2-3B (2.93B) 4.62B
image is transformed into a sequence of embeddings by the
vision encoder and these embeddings are concatenated with Table 1: Model variants and details of their parameter counts and
split among vision and language models. The image encoders are
the input text embeddings and fed into the mT5 language en-
based on ViT [Dosovitskiy et al., 2020] and the text encoders are
coder. The output of this encoder is passed to the decoder to based on mT5 [Xue et al., 2020] and UL2 models [Tay et al., 2022].
generate the text output. This generic formulation enables us
to use the same model architecture to solve a variety of vi-
sion and multimodal tasks that can be recast as a text+image imal human labeling (see Section 4.1 for a detailed descrip-
(input) to text (output) problem. Compared to the text input, tion of the pre-training mixture). Contrary to the later fine-
the image embeddings constitute a significant portion of the tuning stage, we train both the vision encoder and the lan-
input length to the multimodal encoder. guage model. The motivation behind training the vision en-
We further extend PaLI’s encoder-decoder architecture to coder is to incorporate the new patching strategy, and to allow
accept various image patching patterns. The original PaLI the model to adapt from natural images to UI-related images.
architecture only accepts a fixed grid pattern of patches for We evaluate the impact of training the vision encoder and of
processing the input images. However, the data we encounter including LLM generated data on a variety of tasks in our
in screen-related domains spans a wide variety of resolutions ablation experiments in Section 5.
and aspect ratios. To have a single model to work across all After some initial steps of pretraining, we perform addi-
screen shapes, it is necessary to use a patching strategy which tional steps with the ViT encoder frozen to further train the
can work well with images of various shapes. To this end, model while reducing the resource consumption.
we borrow a technique introduced in Pix2Struct [Lee et al., Fine-Tuning. During fine-tuning, the model is trained on
2023], which allows us to have image patches with arbitrary mixtures of tasks, most of which are labeled using human
grid shapes based on the input image shape and a pre-defined annotators. These tasks are described in details in Section 4.2.
maximum number of patches, as shown in Figure 1. This en- For QA-related tasks, we start by fine-tuning the model on a
ables us to accommodate input images of various formats and combination of QA-related tasks; then, additional training is
aspect ratios without the need for padding or stretching the performed on each individual tasks separately. For all other
image to a fixed shape, making our model more polyvalent to tasks, we fine-tune the model on each one individually.
handle both mobile (i.e. portrait) and desktop (i.e. landscape)
image formats. In Section 5, we evaluate the impact of each
of these modeling choices. 3 Automatic Data Generation
The pretraining phase of our model’s development is criti-
2.2 Model Configurations cally dependent on access to a vast and diverse dataset. Given
We train models of 3 different sizes containing 670M, 2B the impracticality of manually annotating such an extensive
and 5B parameters. For the 670M and 2B parameter models, dataset, our strategy focuses on automatic data generation.
we start from pre-trained unimodal checkpoints for the vi- This approach leverages specialized smaller models, each
sion encoder and the encoder-decoder language models. For adept at generating and labeling data both efficiently and with
the 5B parameter model, we start from the multimodal pre- a high degree of accuracy.
trained checkpoint from PaLI-3 [Chen et al., 2023a], where In this section, we provide a detailed account of our data
the ViT is trained together with the UL2 [Tay et al., 2022] generation process, particularly highlighting how we gather
based encoder-decoder language model. A breakdown of the and automatically annotate a diverse range of screenshots for
parameter distribution among the vision and language models pretraining our model. This automated approach is not only
can be seen in Table 1. efficient and scalable compared to manual annotation but also
Our patching strategy allows variable aspect ratios and in- ensures a level of data diversity and complexity.
put resolutions, as long as they fit within the allocated se-
3.1 Screen Annotation
quence length budget (2024 embeddings for the 670M model,
2916 embeddings for the 2B model, and 3364 embeddings Our initial step is to equip the model with a comprehensive
for the 5B model). For square images, the corresponding understanding of textual elements, various screen compo-
maximum input resolution is 720 × 720 for the 670M model, nents, and their overall structure and hierarchy. This founda-
756 × 756 for the 2B model, and 812 × 812 for the 5B model. tional understanding is vital for the model’s ability to interpret
and interact accurately with a wide range of user interfaces.
2.3 Stages of Training An extensive collection of screenshots has been amassed
from various devices, including desktops, mobile, and tablets,
In this section, we cover the different stages of training.
by crawling applications and web pages [Raffel et al., 2020].
Pre-Training. Starting from the checkpoints mentioned in These screenshots are then annotated with detailed labels that
Section 2.2, we do a first stage of training on large datasets describe the UI elements, their spatial relationships, and ad-
generated from self-supervision and other models, using min- ditional descriptive information.
Screen schema
generation Generated Data
mixture
Layout extraction (Optional) validation
LLM Icon classification Question-Answering
Icon classification LLM
(PaLM 2)
Navigation
Human
OCR
Summarization
Image captioning
Figure 2: Task generation pipeline: 1) the screens are first annotated using various models; 2) we then use an LLMs to generate screen-related
tasks at scale; 3) (optionally) we validate the data using another LLM or human raters.
Text input: Describe this Text input: What is the name Text input: Select the Text input: Summarize this
screenshot. of the tailor? first item in the list. screenshot.
Target: IMAGE pleasure or love Target: Andrew Ramroop Target: click 15 983 199 Target: The screenshot shows a
follows truthfulness then the 359 news article about UConn men's
merciful appears before him 0 basketball recruiting. The
993 0 261 (TEXT pleasure of article is about Dan Hurley's
love, follows truthfulness, first recruit of the 2021
then the Merciful appears class, Rahsool Diggins, a 6'1″
before him 3 991 0 248), IMAGE point guard from Philadelphia.
a ma...
Figure 4: Sample of tasks that we are using in our pretraining mixture: (a) Screen annotation, with masking; (b) Question-Answering; (c)
Navigation; (d) Summarization. The last three have been generated using our screen annotation model, coupled with PaLM-2-S.
make the following changes to task formulations: (1) we alternative short answers, for each of the questions. We
cast RefExp [Wichers et al., 2018] and Task Automation in use the maximum F1 score across all the candidate an-
MoTIF [Burns et al., 2022] as object detection tasks, with- swers as the metric. See Figure 5 and Appendix F for
out using candidate bounding boxes and report accuracy at more details.
IoU=0.13 considering only one box predicted; (2) for MoTIF, • Complex ScreenQA (Cplx SQA):6 To complement
we report the number for the app-unseen split of the test set SQA Short, we introduce Complex ScreenQA, which
in Table 4, and other split results in in Table 5 of Appendix E. includes more difficult questions (counting, arithmetic,
We supplement the tasks mentioned above with three new comparison, and non-answerable questions) and con-
benchmarks that we release: tains screens with various aspect ratios. See Figures 6
• Screen Annotation (SA):4 To evaluate our model’s lay- and 7 for examples and Appendix G for more details.
out annotation and spatial understanding capabilities, we
create a dedicated benchmark consisting of 4.2K screen- We also provide a few additional details on how we handle
shots from the Rico dataset [Deka et al., 2017]. Each UI Multipage DocVQA and ChartQA.
element has been annotated by human raters, and the an- Multipage DocVQA. The standard fine-tuning task for
notations comprise a bounding box and a UI class from Multipage DocVQA [Tito et al., 2023] can be transformed
the list described in 3.1. We evaluate the model’s predic- into a single-page DocVQA task by pairing the same ques-
tions using object detection metrics, including F1 score, tion with each page of the document and choosing the answer
precision and recall values computed at IoU=0.1. with the highest score among all pages. In this formulation,
• ScreenQA Short (SQA Short):5 ScreenQA [Hsiao et we modify the training set by splitting a question, answer and
al., 2022], a benchmark for screen understanding, con- multipage document into a positive pair (with the actual an-
tains UI elements and full-sentence answers as ground swer for the page containing the answer) and multiple nega-
truth. To align the output format with other question an- tive pairs (with “no answer” for pages which do not contain
swering tasks, we generate a new ground truth, a list of the answer). The negative pairs are subsampled to avoid over-
fitting on not predicting an answer and the original DocVQA
3
Intersection over union at threshold 0.1 task [Mathew et al., 2021] is added to the fine-tuning mixture.
4
https://github.com/google-research-datasets/screen annotation
5 6
https://github.com/google-research-datasets/screen qa?tab= https://github.com/google-research-datasets/screen qa?tab=
readme-ov-file#screenqa-short readme-ov-file#complexqa
Question: How many links and comments are there
Task Name/Benchmark Metric of the post ”Why Michael Flynn kept his Job 17 days
after the White House!” ?
Screen Analysis Full sentence answers:
Screen Annotation [Ours, Sec. 4.2] F1@IoU=0.1 • There is 1 like and 1 comment on the post
”Why Michael Flynn kept his job 17 days af-
Widget Captioning [Li et al., 2020] CIDEr ter the White House!”.
Screen Question-Answering • There is 1 like and 1 comment on the ”Why
Michael Flynn kept his Job 17 days after the
ScreenQA Short [Ours, Sec. 4.2] SQuAD F1 White House!” post.
Complex ScreenQA [Ours, Sec. 4.2] SQuAD F1 • There is 1 like and 1 comment.
WebSRC [Chen et al., 2021b] SQuAD F1 List of short answers:
• one and one
Screen Navigation • 1 and 1
• one, one
RefExp [Bai et al., 2021] Acc@IoU=0.1 • 1, 1
MoTIF-Automation [Burns et al., 2022] Acc@IoU=0.1 • 1 like, 1 comment
• 1 like and 1 comment
Screen Summarization
Screen2Words [Wang et al., 2021] CIDEr Figure 5: Examples of questions and answers from the ScreenQA
Infographics/Doc Visual QAs dataset, together with their LLM-generated short answers.
ChartQA [Masry et al., 2022] Relaxed Acc.
DocVQA [Mathew et al., 2021] ANLS
Multipage DocVQA [Tito et al., 2023] ANLS
InfographicVQA [Mathew et al., 2022] ANLS
OCR-VQA-200K [Mishra et al., 2019] Exact Match
SoTA - - - - 67.6a 130.7b 159.8b 80.8h 90.9h 61.8d 80.3h 77.8b 85.0f
Without OCR
SoTA≤5B - - - - 67.6a 130.7b 159.8b 77.3i 87.8c - 57.8b 76.7b 77.8g
ScreenAI 86.2 86.3 94.6 42.4 87.4 120.8 156.4 76.6 87.5 72.9 61.4 75.0 87.2
With OCR
SoTA≤5B - - - - - - - 70.4c 89.3c 61.8d 62.4b 77.8b 85.0f
ScreenAI - - 94.8 43.5 - 123.7 - 76.7 89.9 77.1 65.9 76.2 -
Table 4: Comparison of ScreenAI with various SoTA models: (a) MoTIF [Burns et al., 2022], (b) PaLI-3 [Chen et al., 2023b], (c)
SmoLA PaLI-X [Wu et al., 2023a], (d) Hi-VT5 [Tito et al., 2023], (e) TILT [Powalski et al., 2021], (f) DocPrompt [Wu et al., 2023b],
(g) DUBLIN [Aggarwal et al., 2023], (h) Gemini [Anil et al., 2023a], (i) ChartPaLI-5B [Carbune et al., 2024]. Bold font highlights SoTA
score, and underscore represents best-in-class score. See Table 3 for details about the tasks and their associated metrics.
0
Screen Annotation Ref Exp SQA Short Complex SQA MoTIF Screen2Words Chart QA DocVQA Infographics VQA OCR VQA
Figure 8: Performance of different model sizes on fine-tuning tasks. The metrics improve consistently as the model size increases.
Figure 9: Ablation study for Pix2Struct vs. fixed-grid patching; the numbers represent the aggregated scores across all fine-tuned tasks. For
aspect ratio > 1.0, using Pix2Struct patching significantly outperforms a fixed grid patching, whereas for aspect ratio < 1.0, a fixed grid
patching outperforms Pix2Struct by a smaller margin.
tasks. We also illustrate the impact of data generation us- DUBLIN–document understanding by language-image network.
ing LLMs and justify our model design choices with ablation arXiv preprint arXiv:2305.14218, 2023.
studies. We apply these techniques to train a model that per- [Aghajanyan et al., 2021] Armen Aghajanyan, Dmytro Okhonko,
forms competitively and achieves SoTA on a number of pub- Mike Lewis, Mandar Joshi, Hu Xu, Gargi Ghosh, and Luke
lic benchmarks. While our model is best-in-class, we note Zettlemoyer. HTLM: Hyper-text pre-training and prompting of
that, on some tasks, further research is needed to bridge the language models, 2021.
gap with models like GPT-4 and Gemini, which are orders [Alayrac et al., 2022] Jean-Baptiste Alayrac, Jeff Donahue, Pauline
of magnitude larger. To encourage further research, we re- Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur
lease a dataset with this unified representation, as well as two Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo:
other datasets to enable more comprehensive benchmarking a visual language model for few-shot learning. Advances in Neu-
of models on screen-related tasks. ral Information Processing Systems, 35:23716–23736, 2022.
[Anil et al., 2023a] Rohan Anil, Sebastian Borgeaud, Yonghui Wu,
Acknowledgements Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalk-
wyk, Andrew M Dai, Anja Hauth, et al. Gemini: a fam-
We would like to thank team alumni Yo Hsiao and Zixian Ma ily of highly capable multimodal models. arXiv preprint
for their contribution to the project, Fangyu Liu, Xi Chen, Efi arXiv:2312.11805, 2023.
Kokiopoulou, Jesse Berent, Gabriel Barcik, Lukas Zilka, Ori-
[Anil et al., 2023b] Rohan Anil, Andrew M Dai, Orhan Firat,
ana Riva, Gang Li, Yang Li, Radu Soricut and Tania Bedrax-
Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak
Weiss for their insightful feedbacks and fruitfull discussions,
Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al.
Rahul Aralikatte, Hao Cheng and Daniel Kim for their whole- PaLM 2 technical report. arXiv preprint arXiv:2305.10403,
hearted and tireless support in data preparation, and Jay Yag- 2023.
nik, Blaise Aguera y Arcas, Ewa Dominowska, David Petrou,
[Bai et al., 2021] Chongyang Bai, Xiaoxue Zang, Ying Xu, Srini-
and Matt Sharifi for their vision and support in leadership.
vas Sunkara, Abhinav Rastogi, Jindong Chen, and Blaise Aguera
y Arcas. UIBert: Learning generic multimodal representations
Contribution Statement for UI understanding, 2021.
First Authors with Equal Contributions: Gilles Baechler, [Burns et al., 2022] Andrea Burns, Deniz Arsan, Sanjna Agrawal,
Srinivas Sunkara, Maria Wang, Jindong Chen. Ranjitha Kumar, Kate Saenko, and Bryan A. Plummer. A dataset
for interactive vision language navigation with unknown com-
Project Leads: Jindong Chen, Abhanshu Sharma mand feasibility. In European Conference on Computer Vision
(ECCV), 2022.
References [Carbune et al., 2024] Victor Carbune, Hassan Mansoor, Fangyu
[Aggarwal et al., 2023] Kriti Aggarwal, Aditi Khandelwal, Kumar Liu, Rahul Aralikatte, Gilles Baechler, Jindong Chen, and Ab-
Tanmay, Owais Mohammed Khan, Qiang Liu, Monojit Choud- hanshu Sharma. Chart-based reasoning: Transferring capabilities
hury, Subhojit Som, Vishrav Chaudhary, and Saurabh Tiwary. from llms to vlms, 2024.
[Carion et al., 2020] Nicolas Carion, Francisco Massa, Gabriel [Huang et al., 2022] Yupan Huang, Tengchao Lv, Lei Cui, Yutong
Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Lu, and Furu Wei. LayoutLMv3: Pre-training for document
Zagoruyko. End-to-end object detection with transformers. ai with unified text and image masking. In Proceedings of the
In European conference on computer vision, pages 213–229. 30th ACM International Conference on Multimedia, pages 4083–
Springer, 2020. 4091, 2022.
[Chen et al., 2021a] Ting Chen, Saurabh Saxena, Lala Li, David J [Kafle et al., 2018] Kushal Kafle, Brian Price, Scott Cohen, and
Fleet, and Geoffrey Hinton. Pix2seq: A language mod- Christopher Kanan. Dvqa: Understanding data visualizations via
eling framework for object detection. arXiv preprint question answering, 2018.
arXiv:2109.10852, 2021. [Kil et al., 2023] Jihyung Kil, Soravit Changpinyo, Xi Chen, Hexi-
[Chen et al., 2021b] Xingyu Chen, Zihan Zhao, Lu Chen, Danyang ang Hu, Sebastian Goodman, Wei-Lun Chao, and Radu Soricut.
Zhang, Jiabao Ji, Ao Luo, Yuxuan Xiong, and Kai Yu. WebSRC: PreSTU: Pre-training for scene-text understanding. In Proceed-
A dataset for web-based structural reading comprehension, 2021. ings of the IEEE/CVF International Conference on Computer Vi-
[Chen et al., 2022] Xi Chen, Xiao Wang, Soravit Changpinyo, sion, pages 15270–15280, 2023.
AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Good- [Kim et al., 2021] Geewook Kim, Teakgyu Hong, Moonbin Yim,
man, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. PaLi: Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo
A jointly-scaled multilingual language-image model. arXiv Yun, Dongyoon Han, and Seunghyun Park. Donut: Docu-
preprint arXiv:2209.06794, 2022. ment understanding transformer without OCR. arXiv preprint
[Chen et al., 2023a] Xi Chen, Josip Djolonga, Piotr Padlewski, arXiv:2111.15664, 7:15, 2021.
Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme [Kuo et al., 2023] Weicheng Kuo, AJ Piergiovanni, Dahun Kim,
Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, et al. PaLI-X: Xiyang Luo, Ben Caine, Wei Li, Abhijit Ogale, Luowei Zhou,
On scaling up a multilingual vision and language model. arXiv Andrew Dai, Zhifeng Chen, et al. MaMMUT: A simple archi-
preprint arXiv:2305.18565, 2023. tecture for joint learning for multimodal tasks. arXiv preprint
arXiv:2303.16839, 2023.
[Chen et al., 2023b] Xi Chen, Xiao Wang, Lucas Beyer, Alexander
Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil Mustafa, Sebas- [Lee et al., 2023] Kenton Lee, Mandar Joshi, Iulia Raluca Turc,
tian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, et al. Hexiang Hu, Fangyu Liu, Julian Martin Eisenschlos, Ur-
PaLI-3 vision language models: Smaller, faster, stronger. arXiv vashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina
preprint arXiv:2310.09199, 2023. Toutanova. Pix2struct: Screenshot parsing as pretraining for vi-
sual language understanding. In International Conference on Ma-
[Deka et al., 2017] Biplab Deka, Zifeng Huang, Chad Franzen,
chine Learning, pages 18893–18912. PMLR, 2023.
Joshua Hibschman, Daniel Afergan, Yang Li, Jeffrey Nichols,
and Ranjitha Kumar. Rico: A mobile app dataset for building [Li and Li, 2022] Gang Li and Yang Li. Spotlight: Mobile UI un-
data-driven design applications. In Proceedings of the 30th an- derstanding using vision-language models with a focus. arXiv
nual ACM symposium on user interface software and technology, preprint arXiv:2209.14927, 2022.
pages 845–854, 2017. [Li et al., 2020] Yang Li, Gang Li, Luheng He, Jingjie Zheng,
[Deng et al., 2023] Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Hong Li, and Zhiwei Guan. Widget captioning: Generating natu-
Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. ral language description for mobile user interface elements, 2020.
Mind2web: Towards a generalist agent for the web. arXiv [Li et al., 2021] Yang Li, Gang Li, Xin Zhou, Mostafa Dehghani,
preprint arXiv:2306.06070, 2023. and Alexey Gritsenko. VUT: Versatile ui transformer for
[Dosovitskiy et al., 2020] Alexey Dosovitskiy, Lucas Beyer, multi-modal multi-task user interface modeling. arXiv preprint
Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, arXiv:2112.05692, 2021.
Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, [Li et al., 2022a] Gang Li, Gilles Baechler, Manuel Tragut, and
Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 Yang Li. Learning to denoise raw mobile UI layouts for improv-
words: Transformers for image recognition at scale. arXiv ing datasets at scale. In Proceedings of the 2022 CHI Conference
preprint arXiv:2010.11929, 2020. on Human Factors in Computing Systems, pages 1–13, 2022.
[Gehrmann et al., 2022] Sebastian Gehrmann, Sebastian Ruder, Vi- [Li et al., 2022b] Tao Li, Gang Li, Jingjie Zheng, Purple Wang, and
taly Nikolaev, Jan A. Botha, Michael Chavinda, Ankur Parikh, Yang Li. MUG: Interactive multimodal grounding on user inter-
and Clara Rivera. Tata: A multilingual table-to-text dataset for faces, 2022.
african languages, 2022. [Liu et al., 2022] Fangyu Liu, Francesco Piccinno, Syrine Krich-
[Gur et al., 2022] Izzeddin Gur, Ofir Nachum, Yingjie Miao, ene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun,
Mustafa Safdari, Austin Huang, Aakanksha Chowdhery, Sha- Nigel Collier, and Julian Martin Eisenschlos. MatCha: Enhanc-
ran Narang, Noah Fiedel, and Aleksandra Faust. Under- ing visual language pretraining with math reasoning and chart
standing HTML with large language models. arXiv preprint derendering. arXiv preprint arXiv:2212.09662, 2022.
arXiv:2210.03945, 2022. [Liu et al., 2023] Fangyu Liu, Julian Martin Eisenschlos, Francesco
[He et al., 2021] Zecheng He, Srinivas Sunkara, Xiaoxue Zang, Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar
Ying Xu, Lijuan Liu, Nevan Wichers, Gabriel Schubiner, Ruby Joshi, Wenhu Chen, Nigel Collier, and Yasemin Altun. DePlot:
Lee, Jindong Chen, and Blaise Agüera y Arcas. ActionBert: One-shot visual language reasoning by plot-to-table translation,
Leveraging user actions for semantic understanding of user in- 2023.
terfaces, 2021. [Masry et al., 2022] Ahmed Masry, Do Xuan Long, Jia Qing Tan,
[Hsiao et al., 2022] Yu-Chung Hsiao, Fedir Zubach, Maria Wang, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for
et al. ScreenQA: Large-scale question-answer pairs over mobile question answering about charts with visual and logical reason-
app screenshots. arXiv preprint arXiv:2209.08199, 2022. ing. arXiv preprint arXiv:2203.10244, 2022.
[Masry et al., 2023] Ahmed Masry, Parsa Kavehzadeh, Xuan Long learning paradigms. In The Eleventh International Conference on
Do, Enamul Hoque, and Shafiq Joty. Unichart: A universal Learning Representations, 2022.
vision-language pretrained model for chart comprehension and [Tito et al., 2023] Rubèn Tito, Dimosthenis Karatzas, and Ernest
reasoning, 2023. Valveny. Hierarchical multimodal transformers for multipage
[Mathew et al., 2021] Minesh Mathew, Dimosthenis Karatzas, and DocVQA. Pattern Recognition, 144:109834, 2023.
CV Jawahar. DocVQA: A dataset for VQA on document images. [Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki Par-
In Proceedings of the IEEE/CVF winter conference on applica- mar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
tions of computer vision, pages 2200–2209, 2021. Kaiser, and Illia Polosukhin. Attention is all you need. Advances
[Mathew et al., 2022] Minesh Mathew, Viraj Bagal, Rubèn Tito, in neural information processing systems, 30, 2017.
Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. In- [Vedantam et al., 2015] Ramakrishna Vedantam, C. Lawrence Zit-
fographicVQA. In Proceedings of the IEEE/CVF Winter Con- nick, and Devi Parikh. CIDEr: Consensus-based image descrip-
ference on Applications of Computer Vision, pages 1697–1706, tion evaluation, 2015.
2022.
[Wang et al., 2021] Bryan Wang, Gang Li, Xin Zhou, Zhourong
[Methani et al., 2020] Nitesh Methani, Pritha Ganguly, Mitesh M. Chen, Tovi Grossman, and Yang Li. Screen2words: Automatic
Khapra, and Pratyush Kumar. PlotQA: Reasoning over scientific mobile ui summarization with multimodal learning. In The 34th
plots, 2020. Annual ACM Symposium on User Interface Software and Tech-
[Mishra et al., 2019] Anand Mishra, Shashank Shekhar, Ajeet Ku- nology, pages 498–510, 2021.
mar Singh, and Anirban Chakraborty. OCR-VQA: Visual ques- [Wang et al., 2022] Peng Wang, An Yang, Rui Men, Junyang Lin,
tion answering by reading text in images. In ICDAR, 2019. Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren
[Nakano et al., 2021] Reiichiro Nakano, Jacob Hilton, Suchir Bal- Zhou, and Hongxia Yang. OFA: Unifying architectures, tasks,
aji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, and modalities through a simple sequence-to-sequence learning
Shantanu Jain, Vineet Kosaraju, William Saunders, et al. We- framework. In International Conference on Machine Learning,
bGPT: Browser-assisted question-answering with human feed- pages 23318–23340. PMLR, 2022.
back. arXiv preprint arXiv:2112.09332, 2021. [Wang et al., 2023] Dongsheng Wang, Natraj Raman, Mathieu
[Powalski et al., 2021] Rafał Powalski, Łukasz Borchmann, Dawid Sibue, Zhiqiang Ma, Petr Babkin, Simerjot Kaur, Yulong Pei,
Jurkiewicz, Tomasz Dwojak, Michał Pietruszka, and Gabriela Armineh Nourbakhsh, and Xiaomo Liu. DocLLM: A layout-
Pałka. Going full-tilt boogie on document understanding with aware generative language model for multimodal document un-
text-image-layout transformer, 2021. derstanding. arXiv preprint arXiv:2401.00908, 2023.
[Raffel et al., 2020] Colin Raffel, Noam Shazeer, Adam Roberts, [Wichers et al., 2018] Nevan Wichers, Dilek Hakkani-Tür, and Jin-
Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, dong Chen. Resolving referring expressions in images with la-
Wei Li, and Peter J Liu. Exploring the limits of transfer learning beled elements. In 2018 IEEE Spoken Language Technology
with a unified text-to-text transformer. The Journal of Machine Workshop (SLT), pages 800–806. IEEE, 2018.
Learning Research, 21(1):5485–5551, 2020. [Wu et al., 2021] Jason Wu, Xiaoyi Zhang, Jeff Nichols, and Jef-
[Rajpurkar et al., 2016] Pranav Rajpurkar, Jian Zhang, Konstantin frey P Bigham. Screen parsing: Towards reverse engineering
Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for ma- of ui models from screenshots. In The 34th Annual ACM Sympo-
chine comprehension of text, 2016. sium on User Interface Software and Technology, pages 470–483,
2021.
[Rawles et al., 2023] Christopher Rawles, Alice Li, Daniel Ro-
driguez, Oriana Riva, and Timothy Lillicrap. Android in the wild: [Wu et al., 2023a] Jialin Wu, Xia Hu, Yaqing Wang, Bo Pang, and
A large-scale dataset for android device control. arXiv preprint Radu Soricut. Omni-SMoLA: Boosting generalist multimodal
arXiv:2307.10088, 2023. models with soft mixture of low-rank experts, 2023.
[Sharma et al., 2018] Piyush Sharma, Nan Ding, Sebastian Good- [Wu et al., 2023b] Sijin Wu, Dan Zhang, Teng Hu, and Shikun
man, and Radu Soricut. Conceptual captions: A cleaned, hy- Feng. DocPrompt: Large-scale continue pretrain for zero-shot
pernymed, image alt-text dataset for automatic image captioning. and few-shot document question answering, 2023.
In Proceedings of the 56th Annual Meeting of the Association [Xue et al., 2020] Linting Xue, Noah Constant, Adam Roberts, Mi-
for Computational Linguistics (Volume 1: Long Papers), pages hir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and
2556–2565, 2018. Colin Raffel. mT5: A massively multilingual pre-trained text-
[Sunkara et al., 2022] Srinivas Sunkara, Maria Wang, Lijuan Liu, to-text transformer. arXiv preprint arXiv:2010.11934, 2020.
Gilles Baechler, Yu-Chung Hsiao, Abhanshu Sharma, James [Yang et al., 2022] Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xi-
Stout, et al. Towards better semantic understanding of mobile aowei Hu, Faisal Ahmed, Zicheng Liu, Yumao Lu, and Lijuan
interfaces. arXiv preprint arXiv:2210.02663, 2022. Wang. UniTAB: Unifying text and box outputs for grounded
[Tang et al., 2023] Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei vision-language modeling. In European Conference on Com-
Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, and puter Vision, pages 521–539. Springer, 2022.
Mohit Bansal. Unifying vision, text, and layout for universal [Zang et al., 2021] Xiaoxue Zang, Ying Xu, and Jindong Chen.
document processing. In Proceedings of the IEEE/CVF Confer- Multimodal icon annotation for mobile applications. In Pro-
ence on Computer Vision and Pattern Recognition, pages 19254– ceedings of the 23rd International Conference on Mobile Human-
19264, 2023. Computer Interaction, pages 1–11, 2021.
[Tay et al., 2022] Yi Tay, Mostafa Dehghani, Vinh Q Tran, Xavier [Zhang et al., 2021] Xiaoyi Zhang, Lilian de Greef, Amanda
Garcia, Jason Wei, Xuezhi Wang, Hyung Won Chung, Dara Swearngin, Samuel White, Kyle Murray, Lisa Yu, Qi Shan, Jef-
Bahri, Tal Schuster, Steven Zheng, et al. UL2: Unifying language frey Nichols, Jason Wu, Chris Fleizach, et al. Screen recognition:
Creating accessibility metadata for mobile applications from pix-
els. In Proceedings of the 2021 CHI Conference on Human Fac- TOOLBAR 0 998 31 113 (
PICTOGRAM arrow backward 0 135 32 112
tors in Computing Systems, pages 1–15, 2021. TEXT Sacramento, CA 179 549 57 90)
TEXT H16 45 113 115 136
LIST_ITEM 0 994 164 611 (
IMAGE a window with a black curtain 0 91 177 466
IMAGE one building with a few palm trees surrounding it .
Metrics for benchmarks where output is plain text. For PICTOGRAM time 34 87 645 675
TEXT Closed Opens at 12:00 128 518 647 676
BUTTON More info 745 959 636 685)
all other tasks, we use the following metrics: LIST_ITEM 4 988 697 763 (
PICTOGRAM 743 714 87 35
TEXT Schedule for later 129 420 715 744
1. CIDEr - Consensus-based Image Description Evalua- BUTTON Change 778 957 704 754)
tion [Vedantam et al., 2015]. TEXT Unfortunately, this restaurant does not 94 733 811 839
TEXT deliver to your location 90 460 842 868
BUTTON OK 782 931 807 870
PICTOGRAM sad face 475 522 840 867
2. SQuAD F1 - F1 score (harmonic mean of the precision and TEXT Search AkⱭkiKU LilliASSOT 98 603 904 921
NAVIGATION_BAR 0 997 933 999 (
recall) after applying SQuAD (Stanford Question Answering PICTOGRAM arrow backward 187 254 948 984
Dataset) [Rajpurkar et al., 2016] text pre-processing. PICTOGRAM a gray circle with a white background 471 532 951 983
PICTOGRAM nav bar rect 752 809 951 982)
Figure 11: Examples of Screen Navigation data generated using an Figure 12: Examples of questions and answers from the ScreenQA
LLM. The target bounding box is highlighted in red. dataset, together with their LLM-generated short answers.
Annotation output from the best ScreenAI VLM. For each dataset,
Here is another example: the prompts are chosen to target certain types of questions. With
Question: ’What is the status of "24 hr clock this approach, we generate large scale datasets for desktop, mobile,
"?’ mobile with different aspect ratios, and infographics screens. These
Answer elements: [’on’] datasets are used both for pre-training and evaluation. We add an
Full answer: ’The status is "on".’ additional step of human raters verification for the evaluation data.
Rephrases: [’on’, ’enabled’] Figure 13 and Figure 14 show a few examples of LLM-generated
QA data that was verified by humans.
[...] We distinguish three different subsets, each focusing on solving
the various challenges we identified with this task:
Now is your turn.
• Desktop QA and Long Webpage QA: Datasets on desktop
Question: {THE QUESTION}
screens and long (viewport height) webpages, respectively.
Answer elements: {THE UI ELEMENT DESCRIPTION}
The aspect ratio and size of the input images is very different
Full answer: {THE FULL-SENTENCE ANSWER}
compared to other QA datasets.
Rephrases:
• Complex QA datasets: Datasets mainly focused on counting,
F.2 For answers contained in multiple UI elements arithmetic, and comparison operations requiring information
from more than one part of the screen.
For each entry in the ScreenQA dataset where there are more than
one UI elements in the ground truth, we use the following prompt – Complex QA: Mobile app screens
with the PaLM 2-S model [Anil et al., 2023b] to generate a list – Desktop Complex QA: Desktop screens.
of short answers from the question, list of UI elements and full- – Long Webpage Complex QA: Long webpages.
sentence answer:
• Non Answerable QA: Dataset focused on measuring the abil-
List various ways to rephrase the answer. The ity of the model to know when a question cannot be answered
answer should be as short as possible, from the given screen.
without extra words from the question. Use
all provided elements in each answer. Provide
the output in square brackets. H New Benchmarks Repositories
We release three evaluation datasets for tasks described in Sec-
Here is an example: tion 4.2:
Question: ’What’s the temperature?’
Answer elements: [’59’, ’◦ F’] • Screen Annotation (SA): https://github.com/google-
Full answer: ’The temperature is 59 degrees research-datasets/screen annotation
Fahrenheit.’ • ScreenQA Short (SQA Short): https://github.com/google-
Rephrases: [’59◦ F’, ’59 Fahrenheits’, ’59 research-datasets/screen qa?tab=readme-ov-file#screenqa-
degrees Fahrenheit’] short
[...]
Answer: 1
Answer: 2
Answer: 5
Question:
How many offices does Pioneer Cardiovascular have?
Answer: 4
Figure 13: Examples of mobile Complex QA evaluation examples. Figure 14: Examples of desktop Complex QA evaluation examples.