0% found this document useful (0 votes)

66 views8 pages

Pix2code Generating Code From A Graphical User Int

This document summarizes a research paper that proposes a method called "pix2code" to automatically generate computer code from graphical user interface screenshots using deep learning techniques. Pix2code uses a convolutional neural network to encode input images and recurrent neural networks to generate code sequences. It can generate code for iOS, Android and web platforms from a single input image with over 77% accuracy. The paper releases datasets of GUI screenshots paired with source code to facilitate further research.

Uploaded by

data studio

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views8 pages

Pix2code Generating Code From A Graphical User Int

Uploaded by

data studio

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

pix2code: Generating Code from a Graphical User

Interface Screenshot

Tony Beltramelli
UIzard Technologies
Copenhagen, Denmark
[email protected]
arXiv:1705.07962v1 [cs.LG] 22 May 2017

Abstract
Transforming a graphical user interface screenshot created by a designer into com-
puter code is a typical task conducted by a developer in order to build customized
software, websites and mobile applications. In this paper, we show that Deep
Learning techniques can be leveraged to automatically generate code given a graph-
ical user interface screenshot as input. Our model is able to generate code targeting
three different platforms (i.e. iOS, Android and web-based technologies) from a
single input image with over 77% of accuracy.

1 Introduction
The process of implementing client-side software based on a Graphical User Interface (GUI) mockup
created by a designer is the responsibility of developers. Implementing GUI code is, however,
time-consuming and prevent developers from dedicating the majority of their time implementing the
actual features and logic of the software they are building. Moreover, the computer languages used
to implement such GUIs are specific to each target platform; thus resulting in tedious and repetitive
work when the software being built is expected to run on multiple platforms using native technologies.
In this paper, we describe a system that can automatically generate platform-specific computer code
given a GUI screenshot as input. We extrapolate that a scaled version of our method could potentially
end the need for manually-programmed GUIs.
Our first contribution is pix2code, a novel approach based on Convolutional and Recurrent Neural
Networks allowing the generation of computer code from a single GUI screenshot as input. Our
model is able to generate computer code from the pixel values of the input image alone. That is, no
engineered feature extraction pipeline is designed to pre-process the input data. Our experiments
demonstrate the effectiveness of our method for generating computer code for various platforms (i.e.
iOS and Android native mobile interfaces, and multi-platform web-based HTML/CSS interfaces)
without the need for any change or specific tuning to the model. In fact, pix2code can be used as such
to generate code written in different target languages simply by being trained on a different dataset.
A video demonstrating our system is available online1 .
Our second contribution is the release of our synthesized datasets consisting of both GUI screenshots
and associated source code for three different platforms. They will be made freely available2 upon
publication of this paper to foster future research.

2 Related Work
The automatic generation of programs using machine learning techniques is a relatively new field
and program synthesis in a human-readable format have only been addressed very recently. A recent
example is DeepCoder [2], a system able to generate computer programs by leveraging statistical
1
https://uizard.io/research#pix2code
2
https://github.com/tonybeltramelli/pix2code

Submitted to 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.
predictions to augment traditional search techniques. In another work by Gaunt et al. [4], the
generation of source code is enabled by learning the relationships between input-output examples via
differentiable interpreters. Furthermore, Ling et al. [11] recently demonstrated program synthesis
from a mixed natural language and structured program specification as input. It is important to note
that most of these methods rely on Domain Specific Languages (DSLs); computer languages (e.g.
markup languages, programming languages, modeling languages) that are designed for a specialized
domain but are typically more restrictive than full-featured computer languages. Using DSLs thus
limit the complexity of the programming language that needs to be modeled and thus reduce the size
of the search space.
Although the generation of computer programs is an active research field as suggested by these
breakthroughs, code generation from visual inputs is still an unexplored research area. This paper is,
to the best of our knowledge, the first work attempting to address this very problem. In order to exploit
the graphical nature of our input, we can borrow methods from the computer vision literature. In
fact, an important number of research [19, 18, 3, 9] have shown that deep neural networks are able to
learn latent variables describing objects in an image and generate a variable-length textual description
of the objects and their relationships. All these methods rely on two main components. First, a
Convolutional Neural Network (CNN) transforming the raw input image into an intermediary learned
representation. Second, a Recurrent Neural Network (RNN) performing language modeling on the
textual description associated with the input picture. Combining both neural network architectures
allows the generation of image captions with impressive results.

3 pix2code
The task of generating code given a GUI screenshot as input can be compared to the task of generating
English textual descriptions given a scene photography as input. We can thus divide our problem into
three sub-problems. First, a computer vision problem of understanding the given scene (i.e. in this
case, the GUI screenshot image) and inferring the objects present, their identities, positions, and poses
(i.e. buttons, labels, element containers). Second, a language modeling problem of understanding
text (i.e. in this case, computer code) and generating syntactically and semantically correct samples.
Finally, the last challenge is to use the solutions to both previous sub-problems by exploiting the
latent variables inferred from scene understanding to generate corresponding textual descriptions (i.e.
computer code rather than English text) of the objects represented by these variables.

(a) Training (b) Sampling

Figure 1: Overview of the pix2code model architecture. During training, the GUI screenshot picture
is encoded by a CNN-based vision model; the sequence of one-hot encoded tokens corresponding to
DSL code is encoded by a language model consisting of a stack of LSTM layers. The two resulting
encoded vectors are then concatenated and fed into a second stack of LSTM layers acting as a decoder.
Finally, a softmax layer is used to sample one token at a time; the output size of the softmax layer
corresponding to the DSL vocabulary size. Given an image and a sequence of tokens, the model (i.e.
contained in the gray box) is differentiable and can thus be optimized end-to-end through gradient
descent to predict the next token in the sequence. The input state (i.e. a sequence of tokens) is updated
at each prediction to contain the last predicted token. During sampling, the generated DSL code is
compiled to the desired target language using traditional compiler design techniques.

3.1 Vision Model

CNNs are currently the method of choice to solve a wide range of vision problems thanks to their
topology allowing them to learn rich latent representations from the images they are trained on
[14, 10]. We used a CNN to perform unsupervised feature learning by mapping an input image to a
learned fixed-length vector; thus acting as an encoder as shown in Figure 1.

2
The input images are initially re-sized to 256 × 256 pixels (the aspect ratio is not preserved) and the
pixel values are normalized before to be fed in the CNN. No further pre-processing is performed. To
encode each input image to a fixed-size output vector, we exclusively used small 3 × 3 receptive
fields which are convolved with stride 1 as used by Simonyan and Zisserman for VGGNet [15].
These operations are applied twice before to down-sample with max-pooling. The width of the first
convolutional layer is 32, followed by a layer of width 64, and finally width 128. Two fully connected
layers of size 1024 applying the rectified linear unit activation complete the vision model.

(a) iOS GUI screenshot (b) Code describing the GUI written in our DSL
Figure 2: An example of a native iOS GUI written in our DSL.

3.2 Language Model

We designed a simple DSL to describe GUIs as illustrated in Figure 2. In this work we are only
interested in the GUI layout, the different graphical components, and their relationships; thus the
actual textual value of the labels is ignored by our DSL. Additionally to reducing the size of the
search space, the DSL simplicity also reduces the size of the vocabulary (i.e. the total number of
tokens supported by the DSL). As a result, our language model can perform token-level language
modeling with a discrete input from using one-hot encoded vectors; eliminating the need for word
embedding techniques such as word2vec [12] that can result in costly computations.
In most programming languages and markup languages, an element is declared with an opening
token; if children elements or instructions are contained within a block, a closing token is usually
needed for the interpreter or the compiler. In such a scenario where the number of children elements
contained in a parent element is variable, it is important to model long-term dependencies to be
able to close a block that has been opened. Traditional RNN architectures suffer from vanishing
and exploding gradients preventing them from being able to model such relationships between data
points spread out in time series (i.e. in this case tokens spread out in a sequence). Hochreiter and
Schmidhuber [8] proposed the Long Short-Term Memory (LSTM) neural architecture in order to
address this very problem. The different LSTM gate outputs can be computed as follows:

it = φ(Wix xt + Wiy ht−1 + bi ) (1)

ft = φ(Wf x xt + Wf y ht−1 + bf ) (2)
ot = φ(Wox xt + Woy ht−1 + bo ) (3)
ct = ft • ct−1 + it • σ(Wcx xt + Wcy ht−1 + bc ) (4)
ht = ot • σ(ct ) (5)

With W the matrices of weights, xt the new input vector at time t, ht−1 the previously produced
output vector, ct−1 the previously produced cell state’s output, b the biases, and φ and σ the acti-
vation functions sigmoid and hyperbolic tangent, respectively. The cell state c learns to memorize
information by using a recursive connection as done in traditional RNN cells. The input gate i is
used to control the error flow on the inputs of cell state c to avoid input weight conflicts that occur in
traditional RNN because the same weight has to be used for both storing certain inputs and ignoring
others. The output gate o controls the error flow from the outputs of the cell state c to prevent output
weight conflicts that happen in standard RNN because the same weight has to be used for both
retrieving information and not retrieving others. The LSTM memory block can thus use i to decide

3
when to write information in c and use o to decide when to read information from c. We used the
LSTM variant proposed by Gers and Schmidhuber [5] with a forget gate f to reset memory and help
the network model continuous sequences.

3.3 Combining Models

Our model is trained in a supervised learning manner by feeding an image I and a sequence X
of T tokens xt , t ∈ {0 . . . T − 1} as inputs; and the token xT as the target label. As shown on
Figure 1, a CNN-based vision model encodes the input image I into a vectorial representation p. The
input token xt is encoded by an LSTM-based language model into an intermediary representation qt
allowing the model to focus more on certain tokens and less on others [7]. This first language model
is implemented as a stack of two LSTM layers with 128 cells each. The vision-encoded vector p
and the language-encoded vector qt are concatenated into a single vector rt which is then fed into a
second LSTM-based model decoding the representations learned by both the vision model and the
language model. The decoder thus learns to model the relationship between objects present in the
input GUI image and the associated tokens present in the DSL code. Our decoder is implemented as
a stack of two LSTM layers with 512 cells each. This architecture can be expressed mathematically
as follows:

p = CN N (I) (6)
qt = LST M (xt ) (7)
rt = (q, pt ) (8)
yt = sof tmax(LST M 0 (rt )) (9)
xt+1 = yt (10)

This architecture allows the whole pix2code model to be optimized end-to-end with gradient descent
to predict a token at a time after it has seen both the image as well as the preceding tokens in the
sequence. The discrete nature of the output (i.e. fixed-sized vocabulary of tokens in the DSL) allows
us to reduce the task to a classification problem. That is, the output layer of our model has the same
number of cells as the vocabulary size; thus generating a probability distribution of the candidate
tokens at each time step allowing the use of a softmax layer to perform multi-class classification.

3.4 Training

The length T of the sequences used for training is important to model long-term dependencies;
for example to be able to close a block of code that has been opened. After conducting empirical
experiments, the DSL code input file used for training was segmented with a sliding window of
size 48; in other words, we unroll the recurrent neural network for 48 steps. This was found to be a
satisfactory trade-off between long-term dependencies learning and computational cost. For every
token in the input DSL file, the model is therefore fed with both an input image and a sequence of
T = 48 tokens. While the sequence of tokens used for training is updated at each time step (i.e.
each token) by sliding the window, the very same input image I is reused for samples associated
with the same GUI. The special tokens <START> and <END> are used to respectively prefix and
suffix the DSL code files similarly to the method used by Karpathy et al. [9]. Training is performed
by computing the partial derivatives of the loss with respect to the network weights calculated with
backpropagation to minimize the multiclass log loss:

T
X
L(I, X) = − xt+1 log(yt ) (11)
t=1

With xt+1 the expected token, and yt the predicted token. The model is optimized end-to-end hence
the loss L is minimized with regard to all the parameters including all layers in the CNN-based vision
model and all layers in both LSTM-based models. Training with the RMSProp algorithm [17] gave
the best results with a learning rate set to 1e − 4 and by clipping the output gradient to the range
[−1.0, 1.0] to cope with numerical instability [7]. To prevent overfitting, a dropout regularization
[16] set to 25% is applied to the vision model after each max-pooling operation and at 30% after
each fully-connected layer. In the LSTM-based models, dropout is set to 10% and only applied to
the non-recurrent connections [21]. Our model was trained with mini-batches of 64 image-sequence
pairs.

4
3.5 Sampling
To generate DSL code, we feed the GUI image I and a sequence X of T = 48 tokens where
tokens xt . . . xT −1 are initially set empty and the last token of the sequence xT is set to the special
< ST ART > token. The predicted token yt is then used to update the next sequence of input tokens.
That is, xt . . . xT −1 are set to xt+1 . . . xT (xt is thus discarded), with xT set to yt . The process is
repeated until the token < EN D > is generated by the model. The generated DSL code can then be
compiled with traditional compilation methods to the desired target language.

Training set Test set

Dataset type Synthesizable
Instances Samples Instances Samples
iOS UI (Storyboard) 26 × 105 1500 93672 250 15984
Android UI (XML) 21 × 106 1500 85756 250 14265
web-based UI (HTML/CSS) 31 × 104 1500 143850 250 24108

Table 1: Dataset statistics. The column Synthesizable refers to the maximum number of unique GUI
configuration that can be synthesized using our data synthesis algorithm. The columns Instances
refers to the number of synthesized (GUI screenshot, GUI code) file pairs. The columns Samples
refers to the number of distinct image-sequence pairs. In fact, training and sampling are done one
token at a time by feeding the model with an image and a sequence of tokens obtained with a sliding
window of fixed size T . The total number of training samples thus depends on the total number of
tokens written in the DSL code files and the size of the sliding window which we set to T = 48.

4 Experiments
Access to consequent datasets is a typical bottleneck when training deep neural networks. To the best
of our knowledge, no dataset consisting of both GUI screenshots and source code was available at the
time this paper was written. As a consequence, we synthesized our own data resulting in the three
datasets described in Table 1 that will be open-sourced. Our data synthesis algorithm is designed
to synthesize GUIs written in our DSL which is then compiled to the desired target language to be
rendered. Using data synthesis also allows us to demonstrate the capability of our model to generate
computer code for three different platforms.

(a) pix2code training loss (b) Micro-average ROC curves

Figure 3: Training loss on different datasets and ROC curves calculated during sampling with the
model trained for 10 epochs.

Our model has around 109 × 106 parameters to optimize and all experiments are performed with the
same model with no specific tuning; only the training datasets differ as shown on Figure 3. Code
generation is performed with both greedy search and beam search to find the tokens that maximize
the classification probability. To evaluate the quality of the generated output, the classification error
is computed for each sampled DSL token and averaged over the whole test dataset. The length
difference between the generated and the expected token sequences is also counted as error. The
results can be seen on Table 2.

5
Error (%)
Dataset type
greedy search beam search 3 beam search 5
iOS UI (Storyboard) 22.73 25.22 23.94
Android UI (XML) 22.34 23.58 40.93
web-based UI (HTML/CSS) 12.14 11.01 22.35

Table 2: Experiments results reported for the test sets described in Table 1.

Figures 4, 5, and 6 show samples consisting of input GUIs (i.e. ground truth), and output GUIs
generated by a trained pix2code model. It is important to remember that the actual textual value
of the labels is ignored and that both our data synthesis algorithm and our DSL compiler assigns
randomly generated text to the labels. Despite occasional problems to select the right color or the
right style for specific GUI elements and some difficulties modelling GUIs consisting of long lists of
graphical components, our model is generally able to learn the GUI layout in a satisfying manner and
can preserve the hierarchical structure of the graphical elements.

(a) Groundtruth GUI 1 (b) Generated GUI 1 (c) Groundtruth GUI 2 (d) Generated GUI 2
Figure 4: Experiment samples for the iOS GUI dataset.

5 Conclusion and Discussions

In this paper, we presented pix2code, a novel method to generate computer code given a single GUI
screenshot as input. While our work demonstrates the potential of such a system to automate GUI
programming, we only scratched the surface of what is possible. Our model consists of relatively few
parameters and was trained on a relatively small dataset. The quality of the generated code could be
drastically improved by training a bigger model on significantly more data for an extended number of
epochs. Experimenting with various regularization methods or implementing an attention mechanism
[1] could further improve the quality of the generated code.
Using one-hot encoding as we did does not provide any useful information about the relationships
between the tokens since the method simply assigns an arbitrary vector representation to each token.
Therefore, pre-training the language model to learn vectorial representations of the tokens would
allow the relationships between tokens in the DSL to be inferred (i.e. learning word embeddings
such as word2vec [12]) and as a result alleviate semantical error in the generated code. Furthermore,
one-hot encoding does not scale to very big vocabulary and thus restrict the number of symbols that
the DSL can support.
Generative Adversarial Networks GANs [6] have shown to be extremely powerful at generating
images and sequences [20, 13, 22]. Applying such techniques to the problem of generating computer
code from an input image is so far an unexplored research area. GANs could potentially be used as a
standalone method to generate code or could be used in combination with our pix2code model to
fine-tune results.
A major drawback of deep neural networks is the need for a lot of training data for the resulting
model to generalize well on new unseen examples. One of the significant advantages of the method
we described in this paper is that there is no need for human-labelled data. In fact, the network
can model the relationships between graphical components and associated code by simply being

6
trained on image-sequence pairs. Although we used data synthesis in our paper partly to demonstrate
the capability of our method to generate GUI code for various platforms; data synthesis might
not be needed at all if one wants to focus only on web-based GUIs. In fact, one could imagine
crawling the World Wide Web to collect a dataset of HTML/CSS code associated with screenshots
of renderer GUIs. Considering a large number of websites already available online and the fact
that new websites are created every day, the web could theoretically supply an unlimited amount of
training data. We extrapolate that Deep Learning used in this manner could eventually end the need
for manually-programmed GUIs.

(a) Groundtruth GUI 3 (b) Generated GUI 3 (c) Groundtruth GUI 4 (d) Generated GUI 4
Figure 5: Experiment samples from the Android GUI dataset.

(a) Groundtruth GUI 5 (b) Generated GUI 5

(c) Groundtruth GUI 6 (d) Generated GUI 6

Figure 6: Experiment samples from the web-based GUI dataset.

7
References
[1] Bahdanau, D., Cho, K., and Bengio, Y.: 2014, arXiv preprint arXiv:1409.0473
[2] Balog, M., Gaunt, A. L., Brockschmidt, M., Nowozin, S., and Tarlow, D.: 2016, arXiv preprint
arXiv:1611.01989
[3] Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K.,
and Darrell, T.: 2015, in Proceedings of the IEEE conference on computer vision and pattern
recognition, pp 2625–2634
[4] Gaunt, A. L., Brockschmidt, M., Singh, R., Kushman, N., Kohli, P., Taylor, J., and Tarlow, D.:
2016, arXiv preprint arXiv:1608.04428
[5] Gers, F. A., Schmidhuber, J., and Cummins, F.: 2000, Neural computation 12(10), 2451
[6] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.,
and Bengio, Y.: 2014, in Advances in neural information processing systems, pp 2672–2680
[7] Graves, A.: 2013, arXiv preprint arXiv:1308.0850
[8] Hochreiter, S. and Schmidhuber, J.: 1997, Neural computation 9(8), 1735
[9] Karpathy, A. and Fei-Fei, L.: 2015, in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp 3128–3137
[10] Krizhevsky, A., Sutskever, I., and Hinton, G. E.: 2012, in Advances in neural information
processing systems, pp 1097–1105
[11] Ling, W., Grefenstette, E., Hermann, K. M., Kočiskỳ, T., Senior, A., Wang, F., and Blunsom, P.:
2016, arXiv preprint arXiv:1603.06744
[12] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J.: 2013, in Advances in neural
information processing systems, pp 3111–3119
[13] Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., and Lee, H.: 2016, in Proceedings of
The 33rd International Conference on Machine Learning, Vol. 3
[14] Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., and LeCun, Y.: 2013, arXiv
preprint arXiv:1312.6229
[15] Simonyan, K. and Zisserman, A.: 2014, arXiv preprint arXiv:1409.1556
[16] Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R.: 2014,
Journal of Machine Learning Research 15(1), 1929
[17] Tieleman, T. and Hinton, G.: 2012, COURSERA: Neural networks for machine learning 4(2)
[18] Vinyals, O., Toshev, A., Bengio, S., and Erhan, D.: 2015, in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp 3156–3164
[19] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A. C., Salakhutdinov, R., Zemel, R. S., and Bengio,
Y.: 2015, in ICML, Vol. 14, pp 77–81
[20] Yu, L., Zhang, W., Wang, J., and Yu, Y.: 2016, arXiv preprint arXiv:1609.05473
[21] Zaremba, W., Sutskever, I., and Vinyals, O.: 2014, arXiv preprint arXiv:1409.2329
[22] Zhang, H., Xu, T., Li, H., Zhang, S., Huang, X., Wang, X., and Metaxas, D.: 2016, arXiv
preprint arXiv:1612.03242

Pretraining and Evaluation CodeLLMs
No ratings yet
Pretraining and Evaluation CodeLLMs
71 pages
WaterGEMS Manual
100% (1)
WaterGEMS Manual
1,396 pages
Xulu Yao Thesis
No ratings yet
Xulu Yao Thesis
120 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
83 pages
ImageCaptioning (BLIP) Final
No ratings yet
ImageCaptioning (BLIP) Final
90 pages
CARTOON OF AN IMAGE Documentation
No ratings yet
CARTOON OF AN IMAGE Documentation
38 pages
Generative AI For Software Development - Curriculum
No ratings yet
Generative AI For Software Development - Curriculum
1 page
BTP Report
No ratings yet
BTP Report
27 pages
Image To TXT Original Final
No ratings yet
Image To TXT Original Final
32 pages
Design2Code: How Far Are We From
No ratings yet
Design2Code: How Far Are We From
21 pages
Pix2code: Generating Code From A Graphical User Interface Screenshot
No ratings yet
Pix2code: Generating Code From A Graphical User Interface Screenshot
9 pages
NEW PDF
No ratings yet
NEW PDF
48 pages
Dense Caption Imagining
No ratings yet
Dense Caption Imagining
8 pages
Journey DB
No ratings yet
Journey DB
20 pages
ai-image-generator
No ratings yet
ai-image-generator
37 pages
2102.04664v2
No ratings yet
2102.04664v2
14 pages
Unlocking The Conversion of Web Screenshots Into HTML Code With The WebSight Dataset
No ratings yet
Unlocking The Conversion of Web Screenshots Into HTML Code With The WebSight Dataset
9 pages
A Deep Learning Based Object Detection System For User Interface Code Generation
No ratings yet
A Deep Learning Based Object Detection System For User Interface Code Generation
5 pages
Intro Ai Group3
No ratings yet
Intro Ai Group3
35 pages
VISION2UI_A Real-World Dataset with Layout for
No ratings yet
VISION2UI_A Real-World Dataset with Layout for
10 pages
BTP Report On Text To Image Synthesis
No ratings yet
BTP Report On Text To Image Synthesis
62 pages
ROHAN PRASAD FinalProjectReport - Rohan Gamer
No ratings yet
ROHAN PRASAD FinalProjectReport - Rohan Gamer
39 pages
fin_irjmets1715742677
No ratings yet
fin_irjmets1715742677
6 pages
IJPREMS50400010480
No ratings yet
IJPREMS50400010480
5 pages
Acd
No ratings yet
Acd
15 pages
Automated HTML
No ratings yet
Automated HTML
5 pages
Automated Source Code Generation and Auto-Completion Using Deep Learning: Comparing and Discussing Current Language Model-Related Approaches
No ratings yet
Automated Source Code Generation and Auto-Completion Using Deep Learning: Comparing and Discussing Current Language Model-Related Approaches
16 pages
Session 14 Generative AI - For Software Engineering
No ratings yet
Session 14 Generative AI - For Software Engineering
22 pages
Generating_Caption_From_Images_Using_Flickr_Image_Dataset
No ratings yet
Generating_Caption_From_Images_Using_Flickr_Image_Dataset
7 pages
visualcode-description
No ratings yet
visualcode-description
3 pages
B8. Design Patterns En
No ratings yet
B8. Design Patterns En
146 pages
Project Synopsis Imagecaptioning
No ratings yet
Project Synopsis Imagecaptioning
5 pages
A Novel Approach of Image Caption Generator Using Deep Learning
No ratings yet
A Novel Approach of Image Caption Generator Using Deep Learning
6 pages
Cyber Crime Investigation Manual 22
No ratings yet
Cyber Crime Investigation Manual 22
132 pages
A_Novel_Approach_of_Image_Caption_Generator_using_Deep_Learning
No ratings yet
A_Novel_Approach_of_Image_Caption_Generator_using_Deep_Learning
6 pages
Image Generation From Caption
No ratings yet
Image Generation From Caption
10 pages
ML Project
No ratings yet
ML Project
13 pages
DL Group 6 Rep
No ratings yet
DL Group 6 Rep
11 pages
KVM UserManual
No ratings yet
KVM UserManual
136 pages
Paper 17881
No ratings yet
Paper 17881
6 pages
DW & Caption Generator - Paper 1
No ratings yet
DW & Caption Generator - Paper 1
6 pages
Ijirt149669 Paper
No ratings yet
Ijirt149669 Paper
4 pages
ijariie26613
No ratings yet
ijariie26613
5 pages
Building A System That Can Generate High
No ratings yet
Building A System That Can Generate High
2 pages
Indian Institute OF Information Technology Allahabad: Text To Image Synthesis
No ratings yet
Indian Institute OF Information Technology Allahabad: Text To Image Synthesis
8 pages
Vision For Ode
No ratings yet
Vision For Ode
5 pages
Image Caption Generator Research Paper
No ratings yet
Image Caption Generator Research Paper
4 pages
gc_2025_01_01
No ratings yet
gc_2025_01_01
13 pages
Mathematics
No ratings yet
Mathematics
27 pages
Automatic Image Caption Generation System
No ratings yet
Automatic Image Caption Generation System
4 pages
Embedded Systems Project Report (1)
No ratings yet
Embedded Systems Project Report (1)
20 pages
1998_1000_DOC_AI-Powered Code Generation
No ratings yet
1998_1000_DOC_AI-Powered Code Generation
5 pages
Generative AI For Software Development - Curriculum 2
No ratings yet
Generative AI For Software Development - Curriculum 2
1 page
BB Orientation
100% (1)
BB Orientation
26 pages
Extraction and Classification of User Interface Components From An Image
No ratings yet
Extraction and Classification of User Interface Components From An Image
16 pages
Reading 2.0
No ratings yet
Reading 2.0
3 pages
DM-6 User Guide
No ratings yet
DM-6 User Guide
2 pages
Automation Engineering JD Docx Bc975393c3
No ratings yet
Automation Engineering JD Docx Bc975393c3
2 pages
Visual Recognition of Graphical User Interface Components Using Deep Learning Technique
No ratings yet
Visual Recognition of Graphical User Interface Components Using Deep Learning Technique
9 pages
HPE - c04952001 - HPE iLO Licensing - Standard and Licensed Features On ProLiant Gen10, Gen9 and Gen8 Servers
No ratings yet
HPE - c04952001 - HPE iLO Licensing - Standard and Licensed Features On ProLiant Gen10, Gen9 and Gen8 Servers
16 pages
Injection Molding Parameters Calculations by Using Visual Basic (VB) Programming
No ratings yet
Injection Molding Parameters Calculations by Using Visual Basic (VB) Programming
9 pages
Samplereusme
No ratings yet
Samplereusme
1 page
Assignment 1 - VIT BHOPAL
No ratings yet
Assignment 1 - VIT BHOPAL
7 pages
Manjeet Logo Design
No ratings yet
Manjeet Logo Design
13 pages
Atlassian Quote AT-92364335 PDF
No ratings yet
Atlassian Quote AT-92364335 PDF
4 pages
Elastix Call Center Manual Eng
0% (1)
Elastix Call Center Manual Eng
25 pages
Wireless Network Security: The Popularity of Wifi
No ratings yet
Wireless Network Security: The Popularity of Wifi
7 pages
Silent Aim Streamble
No ratings yet
Silent Aim Streamble
3 pages
Execution
No ratings yet
Execution
2 pages
AI Trends of May 2023 You Need To Know by Gonzalo Recio Medium
No ratings yet
AI Trends of May 2023 You Need To Know by Gonzalo Recio Medium
1 page
Data Security Best Practices
No ratings yet
Data Security Best Practices
27 pages
Endress & Hauser - TankVision Professional - TI00448GEN - 1516
No ratings yet
Endress & Hauser - TankVision Professional - TI00448GEN - 1516
18 pages
Use Case Template 11 21 09
No ratings yet
Use Case Template 11 21 09
1 page
Course Introduction
No ratings yet
Course Introduction
1 page
Programming 2 (Structured Programming) : Worktext in ITC 106
No ratings yet
Programming 2 (Structured Programming) : Worktext in ITC 106
24 pages
Capgemini Updated Dsa Material
100% (1)
Capgemini Updated Dsa Material
18 pages
Echoview Data & Hardware Support: Supported Hardware Other Hardware
No ratings yet
Echoview Data & Hardware Support: Supported Hardware Other Hardware
2 pages
Blood Bank Management System Project
No ratings yet
Blood Bank Management System Project
5 pages
Quiz Nse2
No ratings yet
Quiz Nse2
2 pages
Top 5 Excel Skills
No ratings yet
Top 5 Excel Skills
8 pages
Automatic Arabic Number Plate Recognition
No ratings yet
Automatic Arabic Number Plate Recognition
7 pages
APPLE STORE Updated
No ratings yet
APPLE STORE Updated
1 page
Select Joins: SQL Cheat Sheet
100% (1)
Select Joins: SQL Cheat Sheet
3 pages
Code Creation: Machine Inventions
From Everand
Code Creation: Machine Inventions
Pasquale De Marco
No ratings yet
The Complete C++ Programming Guide
From Everand
The Complete C++ Programming Guide
gareth thomas
No ratings yet
How To Create An App
From Everand
How To Create An App
Duong Tran
3/5 (8)
Learn IoT Programming Using Node-RED: Begin to Code Full Stack IoT Apps and Edge Devices with Raspberry Pi, NodeJS, and Grafana
From Everand
Learn IoT Programming Using Node-RED: Begin to Code Full Stack IoT Apps and Edge Devices with Raspberry Pi, NodeJS, and Grafana
Bernardo Ronquillo Japón
No ratings yet
ESP32 Programming for the Internet of Things: JavaScript, AJAX, MQTT and WebSockets Solutions
From Everand
ESP32 Programming for the Internet of Things: JavaScript, AJAX, MQTT and WebSockets Solutions
Sever Spanulescu
5/5 (2)
C Programming Wizardry: From Zero to Hero in 10 Days: Programming Prodigy: From Novice to Virtuoso in 10 Days
From Everand
C Programming Wizardry: From Zero to Hero in 10 Days: Programming Prodigy: From Novice to Virtuoso in 10 Days
kok keong teo
No ratings yet
CODING FOR ABSOLUTE BEGINNERS: How to Keep Your Data Safe from Hackers by Mastering the Basic Functions of Python, Java, and C++ (2022 Guide for Newbies)
From Everand
CODING FOR ABSOLUTE BEGINNERS: How to Keep Your Data Safe from Hackers by Mastering the Basic Functions of Python, Java, and C++ (2022 Guide for Newbies)
Eric Vargas
No ratings yet
Code Beneath the Surface: Mastering Assembly Programming
From Everand
Code Beneath the Surface: Mastering Assembly Programming
Kameron Hussain
No ratings yet
How To Program A Mobile Game
From Everand
How To Program A Mobile Game
Duong Tran
4/5 (1)
C Programming: C Programming Language for beginners, teaching you how to learn to code in C fast!
From Everand
C Programming: C Programming Language for beginners, teaching you how to learn to code in C fast!
Adam Dodson
No ratings yet
Color Profile: Exploring Visual Perception and Analysis in Computer Vision
From Everand
Color Profile: Exploring Visual Perception and Analysis in Computer Vision
Fouad Sabry
No ratings yet
Image Collection Exploration: Unveiling Visual Landscapes in Computer Vision
From Everand
Image Collection Exploration: Unveiling Visual Landscapes in Computer Vision
Fouad Sabry
No ratings yet

Pix2code Generating Code From A Graphical User Int

Uploaded by

Pix2code Generating Code From A Graphical User Int

Uploaded by

pix2code: Generating Code from a Graphical User

(a) Training (b) Sampling

3.1 Vision Model

3.2 Language Model

it = φ(Wix xt + Wiy ht−1 + bi ) (1)

3.3 Combining Models

Training set Test set

(a) pix2code training loss (b) Micro-average ROC curves

5 Conclusion and Discussions

(a) Groundtruth GUI 5 (b) Generated GUI 5

(c) Groundtruth GUI 6 (d) Generated GUI 6

You might also like