0% found this document useful (0 votes)
68 views

A Deep Learning Based Object Detection System For User Interface Code Generation

Uploaded by

Charme YEBADOKPO
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views

A Deep Learning Based Object Detection System For User Interface Code Generation

Uploaded by

Charme YEBADOKPO
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

A Deep Learning Based Object Detection System

for User Interface Code Generation


2022 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA) | 978-1-6654-6835-0/22/$31.00 ©2022 IEEE | DOI: 10.1109/HORA55278.2022.9799941

Batuhan Aşıroğlu Sibel Senan Pelin Görgel


Computer Engineering Department Computer Engineering Department Computer Engineering Department
Istanbul University - Cerrahpasa Istanbul University - Cerrahpasa Istanbul University - Cerrahpasa
Istanbul, Turkey Istanbul, Turkey Istanbul, Turkey
w 0000-0003-0767-6348 0000-0001-6773-0428 0000-0001-8884-1290

M. Erdem Isenkul Tolga Ensari Alper Sezen


Computer Engineering Department Department of Research and Development Center
Istanbul University - Cerrahpasa Computer and Information Sciences Turkcell Technology
Istanbul, Turkey Arkansas Tech University Istanbul, Turkey
0000-0003-0856-2174 Arkansas, USA 0000-0002-0140-6462
0000-0003-0896-3058

Mustafa Dağtekin
Computer Engineering Department
Istanbul University - Cerrahpasa
Istanbul, Turkey
0000-0002-0797-9392

Abstract—The Graphical User Interfaces (GUIs) of web appli- with. The GUI design process may begin with a designer
cations include visuals and designs that allow users to interact sketching the “picture” of a user interface using pen and pencil
with machines. Once the GUI design is done, it is necessary or some specialized computer software for creating digital
to generate its GUI code. However, the GUI code generation
process is highly time consuming as well as highly dependent images. The tools may have different levels of abstractions
on software developers. Therefore, the development of automatic ranging from a plain digital image to some sort of layered
GUI code generating systems is of great importance recently. representation which may include some rudimentary code.
In this study, a GUI code generating system for web sites is These creations are usually called “mock-up”s. The mock-up
designed using the Deep Learning (DL) approach. The dataset designers send the work down the development chain and the
including “coordinate, width, height and type” of GUI objects is
created using 7500 webpages. The created dataset is applied to software developers write the code for it. It may circle back
the proposed system in order to detect objects in the GUI image to the mock-up designers for fine tuning multiple times. GUI
and generate DSL mark-up code. Experiments were carried out code generation is often a time-consuming task and it adds up
to analyze the effectiveness of the proposed system and the to a significant amount of the developers’ time. Also in the
performance evaluations were made. GUI development process there is a significant dependency on
Index Terms—GUI, code generation, HTML, deep learning
the designers [1]. To overcome these issues, automatic User
I. I NTRODUCTION Interface (UI) code generator systems have been developed
recently. Hand-drawn, scanned or computer generated mock-
In order to interact with a computer program or an applica- up of a GUI image is given to the system as an input and
tion, a user may be required to input some commands using an appropriate code is generated at the output. Most of the
an “interface” provided by the application. These interfaces studies related to code generation from design images utilize
may be a plain text field to enter queries and commands, or machine learning techniques.
they can have graphical front-ends that have enhanced visual
styles and looks. These graphical fronts of applications are Beltramelli [2] developed a model that generates code for
generally called “Graphical User Interfaces”, or GUIs. The a GUI image automatically. In this study the model was
GUI of an application may have fields to input text, buttons trained with stochastic gradient descent. He obtained various
that execute some action or some other utilities that provide length strings of tokens and used them in Convolutional Neural
a way to further enhance the experience. The creation of the Networks (CNN) and Recurrent Neural Networks (RNN) deep
GUI is a very important part of software development process learning models to generate the code. He achieved over 77%
because the GUI is the part that an end-user sees and interacts of system accuracy. In the study of Halbe and Joshi [1] each

978-1-6654-6835-0/22/$31.00 ©2022 IEEE

Authorized licensed use limited to: Istanbul Universitesi-Cerrahpasa. Downloaded on July 07,2022 at 08:13:10 UTC from IEEE Xplore. Restrictions apply.
HTML control was segmented and identified. Features of the nearest-neighbors algorithm.
controls were extracted via Discrete Cosine Transform (DCT) In this study, we propose a deep learning based system
and they were trained using Learning Vector Quantization for automatic GUI code generation of web pages. The data
Neural Network (LVQ NN) with ”log sigmoid target vector”. related to coordinate, width, height and type of the components
Jain et al. [3] implemented a system that used sketch images on approximately 7,300 web page GUI images are extracted
and converted them into their corresponding GUI codes for to create the dataset. This data composes the feature matrix
different platforms. Their study identifies the components which is used as output in a CNN deep learning architecture.
and provides them as JavaScript Object Notation (JSON) The proposed system is able to generate an intermediate
structures. They trained a neural network and the architecture DSL (Domain Specific Language) code for the input mock-
learned the shape of the GUI elements to identify them in input up images. The DSL code is a generic, XML-like code to
images. Then the overlapped components were fixed before the provide the types, locations, content and hierarchy of the GUI
automatic code generation. Pang et al. [4] proposed a model components extracted from the image.
for code generation which combined visual attention based
GUI features and DSL (Domain Specific Language) attention II. M ATERIALS AND M ETHODS
based semantic features which were obtained using long short- A. Generated Dataset
term memory networks (LSTM). Their system produced DSL
code from GUIs regarding semantic information. Moreover The complete dataset consists of 3026 unique web page tem-
they used ON-LSTM for generation of the DSL code with plates captured as screenshots from open-source web coding
more accurate grammar rules. Pawade et al. [5] developed platforms. All images were explained with bounding boxes
a system that produced HTML code for the web pages and class labels by analyzing the location, component type
automatically from GUI image templates. They assigned a and hierarchical information in HTML code. HTML code can
color for each GUI element component and evaluated the be represented by DSL (Domain-Specific Languages), which
location coordinates and area of those components. Also, consists of a string of pseudo HTML labels. So that learning
Optical Character Recognition (OCR) was used to extract the model is then used to encode webpages as DSL which can be
labels on the images. For classification, they utilized both transformed to its own HTML code later on. The data were
neural networks and KNN (K-Nearest Neighbors) classifiers. further randomly split into 2723 training and 303 test images.
In the study of Xu et al. [6] web GUI images and HTML-CSS Training images include approximately 30,000 UI elements
(Cascading Style Sheets) codes were simultaneously collected belonging to 7 classes. More detailed class distribution table
at first. Second, components were detected via region-based is listed in Fig.1.
CNN deep learning methods. They combined CNN and LSTM
to generate the GUI code. For the same task Han et al. [7]
utilized attention mechanisms in addition to object detection
to obtain the web page’s CSS style content such as the color or
the character style of the components. They also used CNN
and LSTM to produce HTML code for a web page image
input. Abolafia et al. [8] proposed a novel methodology for
program synthesis using Priority Queue Training, which is
an optimization algorithm. They trained an RNN (Recurrent
Neural Networks) on a data set, synthesized new codes and
added them to the priority queue. By sampling new codes
from the RNN, the queue was updated and iteration was
continued. In our previous work, Asiroglu et al. [9] proposed
an automatic HTML code generation system from hand- Fig. 1. Class distributions of the generated dataset
drawn mock-up images. Firstly, the mock-ups were processed
using computer vision techniques such as dilation and erosion
morphological operations. Then CNN deep learning model B. Proposed Model
was implemented to generate the relevant HTML code. For There are several deep learning models for object detection
the same task, Moran et al. [10] developed a system for tasks. In study, we developed a model based on Xception
Android and iPhone called ReDraw which comprised detec- architecture [11] which is a deep convolutional neural network
tion, classification and assembly. First, the GUI components that involves Depth Wise Separable Convolutions. Xception
were detected from the mock-ups utilizing image processing networks have been proposed to overcome the high computa-
methods. Then these components were classified into domain- tional cost of a large number of parameters when performing
specific types using software repository mining, automated multiplication of 3 channel pictures. By multiplying these
dynamic analysis and deep convolutional neural networks. pictures with various filters, it is aimed to reveal patterns
Finally, they automatically generated the GUI hierarchy from between channels and patterns between height and width
which a prototype application can be assembled using K- features.

Authorized licensed use limited to: Istanbul Universitesi-Cerrahpasa. Downloaded on July 07,2022 at 08:13:10 UTC from IEEE Xplore. Restrictions apply.
Our deep learning model has 256x256x3 images as inputs. III. R ESULTS
Feature extraction is performed by applying Xception archi- We applied the proposed deep learning model to our data set
tecture to those inputs. Then the tensor, which is the output in which 5.000 and 2.325 images are available for training and
of the Xception model, has a size of 16x16x728 is given as test phases, respectively. Precision and recall performance met-
input to the Convolution Layer. For each convolution layer, rics are utilized to obtain the performance evaluation results.
batch normalization is performed and then ReLu activation Precision metric is calculated by dividing the true positive
function is applied. The output of the first convolution layer components by all positive predictions. On the other hand,
(8x8x256 tensor) is fed into the triple convolution block as recall is calculated by dividing the true positive predictions
input. by the number of actual positives. (1) is used for the precision
In the model, there are 4 blocks formed by three convolution metric , p and (2) is used for recall, r [12]. T P , F P and F N
layers. Residual connections are made by adding the first represent the number of true positives, false positives and false
convolution layer in these blocks to the last convolution layer. negatives, respectively.
The flow chart of our model is shown in Fig.2.
TP
p= (1)
Detailed description of our model and steps are as follows: TP + FP

TP
Step 1: Max Pooling, Self Attention and Transpose Convolu- r= (2)
tion layers are used respectively after Block 1. Output of the TP + FN
Transpose Convolution Layer is a 28x28x256 tensor which is In this study we used AP average precision metric using (3)
the input of Block 2. instead of precision and recall single-value metrics to obtain
the precision-recall curve. AP is the the weighted mean of
Step 2: A Max Pooling layer followed by a Self Attention precisions achieved at each threshold. pi , ri and ∆ri represent
layer are used after Block 2. Output of the Transpose Convo- the precision and recall values at the ith threshold and the
lution Layer is 14x14x1024 tensor which is the input of Block difference in recall from i−1 to i respectively. Precision-Recall
3. and ROC curves can be seen in Fig.3 and Fig.4, respectively.
n
Step 3: A Self Attention layer is used after Block 3. The X
output of the Transpose Convolution Layer is a 14x14x512 AP = pi ∆ri (3)
i=0
tensor which is the input of Block 4.
.
Step 4: The Self Attention layer is used after Block 4. Output In Table 1, AP, the area under the precision-recall curve
of Transpose Convolution Layer is a 14x14x256 tensor. (AUC-PRC) and the area under the ROC curve (AUC-ROC)
are given for each component. According to the results the
Step 5: In this step three convolution layers are used in a form and textbox components achieve the highest results
row. The outputs of these layers are (11x11x128), (8x8x64) among all components.
and (5x5x32) dimensional tensors, respectively and given as a
Fully Connected Layer. TABLE I
AP, AUC-PRC AND AUC-ROC VALUES FOR THE COMPONENTS

Step 6: The size of the 5x5x12 tensor is obtained with the


Component AP AUC-PRC AUC-ROC
ReLu activation function used in this layer.
form 0.336 0.350 0.691
Step 7: After this stage, a Dropout layer is used to prevent textbox 0.697 0.717 0.846
the system from overfitting. text 0.390 0.425 0.590
title 0.500 0.538 0.689
Step 8: Then another Fully Connected layer with a ReLU checkbox 0.314 0.443 0.704
activation function is used. The value obtained from this layer a-href 0.727 0.810 0.741
constitutes the Intermediate Output value of the system. This button 0.126 0.137 0.605
value is then split into 3 outputs. These outputs are (5x5x1),
(5x5x4) and (5x5x7) tensor sized, respectively.
IV. D ISCUSSION AND F UTURE W ORK
Step 9: Objectness (objectivity - object exists, object does not Automatic GUI code generation of web pages is an impor-
exist) is calculated by applying a Sigmoid activation function tant issue in the area of web application designing. In this
to first output (5x5x1), boundary box (coordinates, height, study, we proposed a novel deep learning model that detects
width) features are calculated by applying Sigmoid activation objects in the GUI image and generates DSL mark-up code.
function to second output (5x5x4), and finally classification The DSL code is a generic, XML-like code to provide the
(class-one hot encoding) output is calculated by applying types, locations, content and hierarchy of the GUI components
softmax activation function to third output (5x5x7). extracted from the image. In order to train our model, we

Authorized licensed use limited to: Istanbul Universitesi-Cerrahpasa. Downloaded on July 07,2022 at 08:13:10 UTC from IEEE Xplore. Restrictions apply.
Fig. 2. Proposed model

The AP, AUC-PR and AUC-ROC metrics were used in the


performance analysis to demonstrate the effectiveness of our
proposed system. In the future, we plan to develop our system
to generate HTML code from the given GUI image.
V. ACKNOWLEDGMENTS
This study has been supported by The Scientific and
Technological Research Council of Turkey (TÜBİTAK)’s
TEYDEB-1505 program, via grant number 5200001.
R EFERENCES
Fig. 3. ROC Curve
[1] A. Halbe and A. R. Joshi, “A Novel Approach to HTML Page
Creation Using Neural Network,” Procedia Computer Science,
vol. 45, no. C, pp. 197–204, 1 2015. [Online]. Available:
https://linkinghub.elsevier.com/retrieve/pii/S1877050915003580
[2] T. Beltramelli, “pix2code: Generating Code from a Graphical User
Interface Screenshot,” Proceedings of the ACM SIGCHI Symposium
on Engineering Interactive Computing Systems - EICS ’18, vol.
8689 LNCS, no. PART 1, pp. 1–6, 5 2018. [Online]. Available:
http://dl.acm.org/citation.cfm?doid=3220134.3220135
[3] V. Jain, P. Agrawal, S. Banga, R. Kapoor, and S. Gulyani,
“Sketch2Code: Transformation of Sketches to UI in Real-time Using
Deep Neural Network,” arXiv Preprint, 10 2019. [Online]. Available:
http://arxiv.org/abs/1910.08930
[4] X. Pang, Y. Zhou, P. Li, W. Lin, W. Wu, and J. Z. Wang, “A
novel syntax-aware automatic graphics code generation with attention-
based deep neural network,” Journal of Network and Computer
Fig. 4. Precision-Recall Curve Applications, vol. 161, p. 102636, 4 2020. [Online]. Available:
https://linkinghub.elsevier.com/retrieve/pii/S1084804520301107
[5] D. Pawade, A. Sakhapara, S. Parab, D. Raikar, R. Bhojane, and
H. Mamania, “Automatic HTML Code Generation from Graphical
created a database that contains coordinate, width, height and User Interface Image,” in 2018 3rd IEEE International Conference
on Recent Trends in Electronics, Information & Communication
type of the components (form, textbox, text, title, checkbox, Technology (RTEICT). IEEE, 5 2018, pp. 749–753. [Online].
a-href, button) on approximately 3000 web page GUI images. Available: https://ieeexplore.ieee.org/document/9012284/

Authorized licensed use limited to: Istanbul Universitesi-Cerrahpasa. Downloaded on July 07,2022 at 08:13:10 UTC from IEEE Xplore. Restrictions apply.
[6] Y. Xu, L. Bo, X. Sun, B. Li, J. Jiang, and W. Zhou, “image2emmet:
Automatic code generation from web user interface image,” Journal
of Software: Evolution and Process, vol. 33, no. 8, 8 2021. [Online].
Available: https://onlinelibrary.wiley.com/doi/10.1002/smr.2369
[7] Y. Han, J. He, and Q. Dong, “CSSSketch2Code: An automatic method
to generate web pages with CSS style,” ACM International Conference
Proceeding Series, pp. 29–35, 2018.
[8] D. A. Abolafia, M. Norouzi, J. Shen, R. Zhao, and Q. V. Le, “Neural
Program Synthesis with Priority Queue Training,” arXiv Preprint, 1
2018. [Online]. Available: http://arxiv.org/abs/1801.03526
[9] B. Asiroglu, B. R. Mete, E. Yildiz, Y. Nalcakan, A. Sezen,
M. Dagtekin, and T. Ensari, “Automatic HTML Code Generation
from Mock-Up Images Using Machine Learning Techniques,” in 2019
Scientific Meeting on Electrical-Electronics & Biomedical Engineering
and Computer Science (EBBT). IEEE, 4 2019, pp. 1–4. [Online].
Available: https://ieeexplore.ieee.org/document/8741736/
[10] K. Moran, C. Bernal-Cardenas, M. Curcio, R. Bonett, and
D. Poshyvanyk, “Machine Learning-Based Prototyping of Graphical
User Interfaces for Mobile Apps,” IEEE Transactions on Software
Engineering, vol. 46, no. 2, pp. 196–221, 2 2020. [Online]. Available:
https://ieeexplore.ieee.org/document/8374985/
[11] F. Chollet, “Xception: Deep Learning with Depthwise Separable
Convolutions,” in 2017 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), vol. 2017-Janua. IEEE, 7 2017, pp. 1800–
1807. [Online]. Available: http://ieeexplore.ieee.org/document/8099678/
[12] M. Zhu, “Recall, precision and average precision,” 2004.

Authorized licensed use limited to: Istanbul Universitesi-Cerrahpasa. Downloaded on July 07,2022 at 08:13:10 UTC from IEEE Xplore. Restrictions apply.

You might also like