A Deep Learning Based Object Detection System For User Interface Code Generation
A Deep Learning Based Object Detection System For User Interface Code Generation
Mustafa Dağtekin
Computer Engineering Department
Istanbul University - Cerrahpasa
Istanbul, Turkey
0000-0002-0797-9392
Abstract—The Graphical User Interfaces (GUIs) of web appli- with. The GUI design process may begin with a designer
cations include visuals and designs that allow users to interact sketching the “picture” of a user interface using pen and pencil
with machines. Once the GUI design is done, it is necessary or some specialized computer software for creating digital
to generate its GUI code. However, the GUI code generation
process is highly time consuming as well as highly dependent images. The tools may have different levels of abstractions
on software developers. Therefore, the development of automatic ranging from a plain digital image to some sort of layered
GUI code generating systems is of great importance recently. representation which may include some rudimentary code.
In this study, a GUI code generating system for web sites is These creations are usually called “mock-up”s. The mock-up
designed using the Deep Learning (DL) approach. The dataset designers send the work down the development chain and the
including “coordinate, width, height and type” of GUI objects is
created using 7500 webpages. The created dataset is applied to software developers write the code for it. It may circle back
the proposed system in order to detect objects in the GUI image to the mock-up designers for fine tuning multiple times. GUI
and generate DSL mark-up code. Experiments were carried out code generation is often a time-consuming task and it adds up
to analyze the effectiveness of the proposed system and the to a significant amount of the developers’ time. Also in the
performance evaluations were made. GUI development process there is a significant dependency on
Index Terms—GUI, code generation, HTML, deep learning
the designers [1]. To overcome these issues, automatic User
I. I NTRODUCTION Interface (UI) code generator systems have been developed
recently. Hand-drawn, scanned or computer generated mock-
In order to interact with a computer program or an applica- up of a GUI image is given to the system as an input and
tion, a user may be required to input some commands using an appropriate code is generated at the output. Most of the
an “interface” provided by the application. These interfaces studies related to code generation from design images utilize
may be a plain text field to enter queries and commands, or machine learning techniques.
they can have graphical front-ends that have enhanced visual
styles and looks. These graphical fronts of applications are Beltramelli [2] developed a model that generates code for
generally called “Graphical User Interfaces”, or GUIs. The a GUI image automatically. In this study the model was
GUI of an application may have fields to input text, buttons trained with stochastic gradient descent. He obtained various
that execute some action or some other utilities that provide length strings of tokens and used them in Convolutional Neural
a way to further enhance the experience. The creation of the Networks (CNN) and Recurrent Neural Networks (RNN) deep
GUI is a very important part of software development process learning models to generate the code. He achieved over 77%
because the GUI is the part that an end-user sees and interacts of system accuracy. In the study of Halbe and Joshi [1] each
Authorized licensed use limited to: Istanbul Universitesi-Cerrahpasa. Downloaded on July 07,2022 at 08:13:10 UTC from IEEE Xplore. Restrictions apply.
HTML control was segmented and identified. Features of the nearest-neighbors algorithm.
controls were extracted via Discrete Cosine Transform (DCT) In this study, we propose a deep learning based system
and they were trained using Learning Vector Quantization for automatic GUI code generation of web pages. The data
Neural Network (LVQ NN) with ”log sigmoid target vector”. related to coordinate, width, height and type of the components
Jain et al. [3] implemented a system that used sketch images on approximately 7,300 web page GUI images are extracted
and converted them into their corresponding GUI codes for to create the dataset. This data composes the feature matrix
different platforms. Their study identifies the components which is used as output in a CNN deep learning architecture.
and provides them as JavaScript Object Notation (JSON) The proposed system is able to generate an intermediate
structures. They trained a neural network and the architecture DSL (Domain Specific Language) code for the input mock-
learned the shape of the GUI elements to identify them in input up images. The DSL code is a generic, XML-like code to
images. Then the overlapped components were fixed before the provide the types, locations, content and hierarchy of the GUI
automatic code generation. Pang et al. [4] proposed a model components extracted from the image.
for code generation which combined visual attention based
GUI features and DSL (Domain Specific Language) attention II. M ATERIALS AND M ETHODS
based semantic features which were obtained using long short- A. Generated Dataset
term memory networks (LSTM). Their system produced DSL
code from GUIs regarding semantic information. Moreover The complete dataset consists of 3026 unique web page tem-
they used ON-LSTM for generation of the DSL code with plates captured as screenshots from open-source web coding
more accurate grammar rules. Pawade et al. [5] developed platforms. All images were explained with bounding boxes
a system that produced HTML code for the web pages and class labels by analyzing the location, component type
automatically from GUI image templates. They assigned a and hierarchical information in HTML code. HTML code can
color for each GUI element component and evaluated the be represented by DSL (Domain-Specific Languages), which
location coordinates and area of those components. Also, consists of a string of pseudo HTML labels. So that learning
Optical Character Recognition (OCR) was used to extract the model is then used to encode webpages as DSL which can be
labels on the images. For classification, they utilized both transformed to its own HTML code later on. The data were
neural networks and KNN (K-Nearest Neighbors) classifiers. further randomly split into 2723 training and 303 test images.
In the study of Xu et al. [6] web GUI images and HTML-CSS Training images include approximately 30,000 UI elements
(Cascading Style Sheets) codes were simultaneously collected belonging to 7 classes. More detailed class distribution table
at first. Second, components were detected via region-based is listed in Fig.1.
CNN deep learning methods. They combined CNN and LSTM
to generate the GUI code. For the same task Han et al. [7]
utilized attention mechanisms in addition to object detection
to obtain the web page’s CSS style content such as the color or
the character style of the components. They also used CNN
and LSTM to produce HTML code for a web page image
input. Abolafia et al. [8] proposed a novel methodology for
program synthesis using Priority Queue Training, which is
an optimization algorithm. They trained an RNN (Recurrent
Neural Networks) on a data set, synthesized new codes and
added them to the priority queue. By sampling new codes
from the RNN, the queue was updated and iteration was
continued. In our previous work, Asiroglu et al. [9] proposed
an automatic HTML code generation system from hand- Fig. 1. Class distributions of the generated dataset
drawn mock-up images. Firstly, the mock-ups were processed
using computer vision techniques such as dilation and erosion
morphological operations. Then CNN deep learning model B. Proposed Model
was implemented to generate the relevant HTML code. For There are several deep learning models for object detection
the same task, Moran et al. [10] developed a system for tasks. In study, we developed a model based on Xception
Android and iPhone called ReDraw which comprised detec- architecture [11] which is a deep convolutional neural network
tion, classification and assembly. First, the GUI components that involves Depth Wise Separable Convolutions. Xception
were detected from the mock-ups utilizing image processing networks have been proposed to overcome the high computa-
methods. Then these components were classified into domain- tional cost of a large number of parameters when performing
specific types using software repository mining, automated multiplication of 3 channel pictures. By multiplying these
dynamic analysis and deep convolutional neural networks. pictures with various filters, it is aimed to reveal patterns
Finally, they automatically generated the GUI hierarchy from between channels and patterns between height and width
which a prototype application can be assembled using K- features.
Authorized licensed use limited to: Istanbul Universitesi-Cerrahpasa. Downloaded on July 07,2022 at 08:13:10 UTC from IEEE Xplore. Restrictions apply.
Our deep learning model has 256x256x3 images as inputs. III. R ESULTS
Feature extraction is performed by applying Xception archi- We applied the proposed deep learning model to our data set
tecture to those inputs. Then the tensor, which is the output in which 5.000 and 2.325 images are available for training and
of the Xception model, has a size of 16x16x728 is given as test phases, respectively. Precision and recall performance met-
input to the Convolution Layer. For each convolution layer, rics are utilized to obtain the performance evaluation results.
batch normalization is performed and then ReLu activation Precision metric is calculated by dividing the true positive
function is applied. The output of the first convolution layer components by all positive predictions. On the other hand,
(8x8x256 tensor) is fed into the triple convolution block as recall is calculated by dividing the true positive predictions
input. by the number of actual positives. (1) is used for the precision
In the model, there are 4 blocks formed by three convolution metric , p and (2) is used for recall, r [12]. T P , F P and F N
layers. Residual connections are made by adding the first represent the number of true positives, false positives and false
convolution layer in these blocks to the last convolution layer. negatives, respectively.
The flow chart of our model is shown in Fig.2.
TP
p= (1)
Detailed description of our model and steps are as follows: TP + FP
TP
Step 1: Max Pooling, Self Attention and Transpose Convolu- r= (2)
tion layers are used respectively after Block 1. Output of the TP + FN
Transpose Convolution Layer is a 28x28x256 tensor which is In this study we used AP average precision metric using (3)
the input of Block 2. instead of precision and recall single-value metrics to obtain
the precision-recall curve. AP is the the weighted mean of
Step 2: A Max Pooling layer followed by a Self Attention precisions achieved at each threshold. pi , ri and ∆ri represent
layer are used after Block 2. Output of the Transpose Convo- the precision and recall values at the ith threshold and the
lution Layer is 14x14x1024 tensor which is the input of Block difference in recall from i−1 to i respectively. Precision-Recall
3. and ROC curves can be seen in Fig.3 and Fig.4, respectively.
n
Step 3: A Self Attention layer is used after Block 3. The X
output of the Transpose Convolution Layer is a 14x14x512 AP = pi ∆ri (3)
i=0
tensor which is the input of Block 4.
.
Step 4: The Self Attention layer is used after Block 4. Output In Table 1, AP, the area under the precision-recall curve
of Transpose Convolution Layer is a 14x14x256 tensor. (AUC-PRC) and the area under the ROC curve (AUC-ROC)
are given for each component. According to the results the
Step 5: In this step three convolution layers are used in a form and textbox components achieve the highest results
row. The outputs of these layers are (11x11x128), (8x8x64) among all components.
and (5x5x32) dimensional tensors, respectively and given as a
Fully Connected Layer. TABLE I
AP, AUC-PRC AND AUC-ROC VALUES FOR THE COMPONENTS
Authorized licensed use limited to: Istanbul Universitesi-Cerrahpasa. Downloaded on July 07,2022 at 08:13:10 UTC from IEEE Xplore. Restrictions apply.
Fig. 2. Proposed model
Authorized licensed use limited to: Istanbul Universitesi-Cerrahpasa. Downloaded on July 07,2022 at 08:13:10 UTC from IEEE Xplore. Restrictions apply.
[6] Y. Xu, L. Bo, X. Sun, B. Li, J. Jiang, and W. Zhou, “image2emmet:
Automatic code generation from web user interface image,” Journal
of Software: Evolution and Process, vol. 33, no. 8, 8 2021. [Online].
Available: https://onlinelibrary.wiley.com/doi/10.1002/smr.2369
[7] Y. Han, J. He, and Q. Dong, “CSSSketch2Code: An automatic method
to generate web pages with CSS style,” ACM International Conference
Proceeding Series, pp. 29–35, 2018.
[8] D. A. Abolafia, M. Norouzi, J. Shen, R. Zhao, and Q. V. Le, “Neural
Program Synthesis with Priority Queue Training,” arXiv Preprint, 1
2018. [Online]. Available: http://arxiv.org/abs/1801.03526
[9] B. Asiroglu, B. R. Mete, E. Yildiz, Y. Nalcakan, A. Sezen,
M. Dagtekin, and T. Ensari, “Automatic HTML Code Generation
from Mock-Up Images Using Machine Learning Techniques,” in 2019
Scientific Meeting on Electrical-Electronics & Biomedical Engineering
and Computer Science (EBBT). IEEE, 4 2019, pp. 1–4. [Online].
Available: https://ieeexplore.ieee.org/document/8741736/
[10] K. Moran, C. Bernal-Cardenas, M. Curcio, R. Bonett, and
D. Poshyvanyk, “Machine Learning-Based Prototyping of Graphical
User Interfaces for Mobile Apps,” IEEE Transactions on Software
Engineering, vol. 46, no. 2, pp. 196–221, 2 2020. [Online]. Available:
https://ieeexplore.ieee.org/document/8374985/
[11] F. Chollet, “Xception: Deep Learning with Depthwise Separable
Convolutions,” in 2017 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), vol. 2017-Janua. IEEE, 7 2017, pp. 1800–
1807. [Online]. Available: http://ieeexplore.ieee.org/document/8099678/
[12] M. Zhu, “Recall, precision and average precision,” 2004.
Authorized licensed use limited to: Istanbul Universitesi-Cerrahpasa. Downloaded on July 07,2022 at 08:13:10 UTC from IEEE Xplore. Restrictions apply.