EgoCoder - Intelligent Program Synthesis With Hierarchical Sequential Neural Network Model
EgoCoder - Intelligent Program Synthesis With Hierarchical Sequential Neural Network Model
2
EgoCoder: Intelligent Program Synthesis with Hierarchical 59
60
3 Sequential Neural Network Model 61
4 62
13 titioners in computer science and other related areas. To learn programmer, people may need to master knowledge from various 71
14 basic programing skills, a long-time systematic training is usually areas, including programing language, discrete mathematics, data 72
15 required for beginners. According to a recent market report, the structure and algorithm, etc. 73
.
computer software market is expected to continue expanding at an Computer programing continues to be a necessary and important
n. ft
16 74
17 accelerating speed, but the market supply of qualified software de- skill for both academic researchers and industry practitioners as the 75
io ra
18 velopers can hardly meet such a huge demand. In recent years, the Internet and AI applications continue to expand. As introduced in 76
19 surge of text generation research works provides the opportunities [2], the computer software market is expanding at an accelerating 77
ut d
20 to address such a dilemma through automatic program synthesis. In speed and is estimated to grow from 19.98 Billion USD in 2014 to 78
this paper, we propose to make our try to solve the program synthe- more than 50.34 Billion USD in 2022. Meanwhile, according to the
ib ing
21 79
22 sis problem from a data mining perspective. To address the problem, latest market analysis report [3], there exists a huge gap between 80
23 a novel generative model, namely EgoCoder, will be introduced in the market supply and demand of software developers. For instance, 81
this paper. EgoCoder effectively parses program code into abstract from January 2016 to February 2017, more than 115, 000 job postings
24
str rk 82
25 syntax trees (ASTs), where the tree nodes will contain the program requesting for qualified software engineers have been posted in 83
26 code/comment content and the tree structure can capture the pro- each month, but the average monthly hire number is merely 33, 579. 84
di o
27 gram logic flows. Based on a new unit model called Hsu, EgoCoder Such a huge demand-supply gap also motivates many large IT 85
or d w
28 can effectively capture both the hierarchical and sequential patterns companies to seek for other ways to address such a problem. 86
29 in the program ASTs. Extensive experiments will be done to com- For effective program code storage and maintenance, inside all 87
30 pare EgoCoder with the state-of-the-art text generation methods, the well-known big IT and related technology companies, they are 88
and the experimental results have demonstrated the effectiveness maintaining a company-internal program codebase for storing all
t f he
31 89
32 of EgoCoder in addressing the program synthesis problem. the developed program code of company systems, web services, 90
software products and research projects. The program code in these
No lis
33 91
34 KEYWORDS codebases is normally of a tremendous amount. A recent report [1] 92
releases the lines of code used in several companies and software
35 Program Synthesis; Text Generation; Neural Networks; Data Mining 93
b
36
systems, among which Google ranks the top with more than 2 94
pu
37 ACM Reference Format: billion lines of code [4] used in all its Internet services. These 95
38
Jiawei Zhang1 , Limeng Cui2 , Fisher B. Gouza1 . 2018. EgoCoder: Intelligent company codebase repositories cover very diverse yet high-quality 96
Program Synthesis with Hierarchical Sequential Neural Network Model. In code, which are also the most valuable intellectual property of
Un
39 97
Proceedings of ACM Conference (Conference’17). ACM, New York, NY, USA, companies, but fail to be effectively exploited.
40 98
10 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn Programing has been long-time treated as one of the most chal-
41 99
42
lenging skills mastered by a very small number of people from 100
43 1 INTRODUCTION some untrained eyes. In this paper, we will make our try to attack 101
44 Formally, programing denotes the process of developing and imple- this holy-grail pride of software engineers by training a model to 102
45 menting computer instructions to enable a computer to perform write programs automatically. The automatic program synthesis 103
46 certain tasks. These instructions are usually written in one or sev- problem is a fundamental problem from the technology, business 104
47 eral programing languages, and a sequence of computer instruc- and society development perspectives. Successfully addressing the 105
48 tions (implementing the pre-specified functions) will be called a problem will effectively bridge the market supply&demand gap 106
49 computer program, which helps the computer to operate smoothly. for qualified practitioners, greatly stimulate the development of IT 107
50
and other related areas, intelligently recycle the company internal 108
51
codebase for secondary-development, and promisingly free human 109
Permission to make digital or hard copies of part or all of this work for personal or
52 classroom use is granted
Unpublished without feedraft.
working providedNot
that copies are not made or distributed
for distribution. from the tedious coding positions to other more challenging jobs. 110
53
for profit or commercial advantage and that copies bear this notice and the full citation In recent years, due to the surge of deep learning developments 111
on the first page. Copyrights for third-party components of this work must be honored.
54 For all other uses, contact the owner/author(s).
[12], many text generation research works and models have been 112
55 Conference’17, July 2017, Washington, DC, USA proposed, which introduce many novel yet interesting research 113
56
© 2018 Copyright held by the owner/author(s). problems. Meanwhile, slightly different from the unstructured sen- 114
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. tences written in natural languages, the program code written in
57 https://doi.org/10.1145/nnnnnnn.nnnnnnn 115
58 2018-05-23 00:25. Page 1 of 1–10. 116
Program Source Code Program Hierarchy
from random import randrange
Module
programing languages is highly structured, which can be precisely class quick_sort:
sions, etc. Therefore, instead of handling the program characters by if len(seq) <= 1:
return seq
elif left < right:
Class
Definition
sequential
Function
sequential
Function
Statement
characters (like the existing text generation research works [31]), pivot = randrange(left, right)
pivot_new_index = partition(seq, left, right, pivot)
sort(seq, left, pivot_new_index - 1)
new techniques that can handle the program according to its own sort(seq, pivot_new_index + 1, right)
return seq
FUNCTION ASSIGN ASSIGN ASSIGN FOR ASSIGN RETURN
The automatic program synthesis problem is extremely challeng- Figure 1: An Example of Program Abstract Syntax Tree.
ing to solve due to several reasons:
• Lack of Problem Definition: The automatic program synthesis sibling nodes at the same levels, as well as evolving information
problem is still an open problem to this context so far. A from the child nodes and inheriting information from the father
formal definition of automatic program synthesis will be node simultaneously. Based on a set of sampled sub-tree batches
from the extracted program ASTs, EgoCoder can be trained effec-
.
required before proposing potential solutions to address it.
n. ft
• Program Hierarchical Structure Extraction: There usually ex- tively to capture the substructures covered in the ASTs. These new
technical terms mentioned above will be clearly illustrated in great
io ra
ists a concrete hierarchical-sequential structure of program
code according to its logic flows hierarchically and sequen- details in this paper.
ut d
tially. Generally, code tokens at the lower level of programs
will precisely implement the desired physical functions of
2 PROBLEM FORMULATION
ib ing
the program components at higher levels; meanwhile, at In this section, we will first define several important concepts used
each level, the logic will flow in a sequential manner from in this paper, based on which we will provide the formulation of
the beginning to the end. Extraction of such a hierarchical- the studied problem and its three different running modes.
str rk
sequential program structure will be useful for effective pro-
gram information modeling and representation learning. 2.1 Terminology Definition
di o
• Unit Model: For each component in the hierarchical-sequential Computer program usually has a highly structured hierarchy, in-
or d w
structure aforementioned, depending on the specific run- volving the code components belonging to different syntax types.
ning mode, it will accept the input from the components
Definition 2.1. (Program Syntax Type): Formally, we can repre-
above/below and before/after the component. A new unit
sent the set of syntax types involved in the program as set C =
model for implementing such an intertwined relationships
t f he
physical function of the program, e.g., ranking, shuffling, into a program abstract syntax tree, where the nodes denote program
pu
searching, factorization and dynamic programing, etc. Effec- code components (i.e., code blocks) belonging to different syntax
tively incorporating the program intention into the learning types, and the links represent the semantic logic flows among the
process will allow both program generation and interpreta- code components.
Un
Statement 2 Variable
Token
Operator
Token Expression
tionships, and each statement further contains multiple sequential Syntax Tree 2
Statement
.
of characters, i.e., ‘p’, ‘i’, ‘v’, ‘o’, · · · , ‘x’, ‘]’, or separate the string
n. ft
• Program Generation: Given the program intention of the top by certain characters among them. Neither of these two partition
io ra
program module component in the program AST, the pro- methods will work well for programs, and they will also create lots
gram generation problem aims at generating the program of problems for modeling the program code and understanding the
ut d
source code that can implement the specified intentions. program intention.
• Program Interpretation: With the complete program source In addition, in most of the cases, program operators will be deeply
ib ing
code or merely a fragment, the program interpretation prob- buried in the variables. For instance, the expression “seq[pivot_index]”
lem aims at inferring the potential intention of the program, actually represents an entry in a list, where “[]” is an operator. With-
i.e., interpreting the physical functions of the program code. out differentiating ‘[’ and ‘]’ from the remaining characters, it is
str rk
• Program Completion: Given a fragment of the program code, highly likely that we will treat “seq[pivot_index]” merely as a new
which can be either a function or merely several statements variable name and fail to process the code correctly. In this paper,
di o
of the code, the program completion problem aims at com- to resolve such a problem, we propose to parse the program code
or d w
pleting the missing components of the program. lines into a program AST instead.
For instance, in Figure 2, we show two examples of program
3 PROPOSED METHODS ASTs corresponding to two input program statements. The first
In this section, we will introduce the EgoCoder framework to solve statement involves the assignment of value “seq[pivot_index]” to
t f he
the automatic program synthesis problem (including all these three variable “pivot_value”. In its AST, we have “pivot_value”, “seq” and
aforementioned sub-problems). Framework EgoCoder involves “pivot_index” as the variable tokens, and “=” and “[]” as the operator
No lis
several crucial steps: (1) program parsing, (2) hierarchical sequential tokens. Furthermore, “seq”, “[]” and “pivot_index” together will
statement encoding with Hsu, and (3) framework learning. In the compose an expression in the syntax tree. For the nodes in the
b
following part of this section, we will introduce these three steps same level, i.e., the siblings, we will add sequential links connecting
pu
in great detail. them, which are denoted by the dashed links as shown in Figure 2.
The second example shown in Figure 2 is more complicated, it is
a “FOR”-statement. According to the provided syntax tree shown in
Un
h⌧i
h⌧i 1
the changes from the lower-level program expression to higher-
fi⌧ z⌧i gi⌧ yi⌧ level program statement, which is effective to represent the changes
in the scope of variables and other program context information
X X
h̃⌧i 1
1- h̃⌧i 1 1-
tanh X tanh X
tanh X
tanh X
tanh X
tanh X
h̃⌧ +1
h̃τ +1 = eτi ⊗ hτ +1 , where eτi = σ We hτi−1 , hτ +1
h̃⌧j 1
⊤
,
1- 1-
X X
h⌧ +1 e⌧i r⌧i ESU t⌧i s⌧i ISU
where We denotes the variable matrix in the “evolve gate” in ESU.
ESU computes the output with the original inputs from sibling
HSU
and children nodes, i.e., hτi−1 , hτ +1 , as well as the updated sibling-
h⌧1 +1 h⌧2 +1 …… h⌧n+1 node state vector h̃τi−1 and the evolved child-node state vector
Figure 3: The Hierarchical Sequential Unit (HSU) Model h̃τ +1 . ESU allows different combinations of the state vectors, which
are controlled by two new selection gates zτi and rτi respectively.
package1 . With these tools, instead of modeling the program raw
.
Formally, we can represent the final output of ESU as
n. ft
textual code, we can translate the program into its AST, and the
io ra
following learning steps will be all based on the obtained ASTs by hτi = zτi ⊗ rτi ⊗ tanh Wu [h̃τi−1 , h̃τ +1 ]⊤
default.
ut d
⊕ (1 ⊖ zτi ) ⊗ rτi ⊗ tanh Wu [hτi−1 , h̃τ +1 ]⊤
ib ing
⊕ zτi ⊗ (1 ⊖ rτi ) ⊗ tanh Wu [h̃τi−1 , hτ +1 ]⊤
As shown in the constructed ASTs, among the nodes in the tree
structured diagram, there exist two different relationship types: hi-
⊕ (1 ⊖ zτi ) ⊗ (1 ⊖ rτi ) ⊗ tanh Wu [hτi−1 , hτ +1 ]⊤ ,
erarchical relationship between the father nodes and children nodes
str rk
at different levels, and sequential relationship between sibling nodes
at the same levels. To effectively model the contents of the nodes as where zτi = σ (Wz [hτi−1 , hτ +1 ]⊤ ), rτi = σ (Wr [hτi−1 , hτ +1 ]⊤ ), and 1
di o
well as the hierarchical-sequential relationships among the nodes, denotes a vector filled with value 1. Matrices Wu , Wz , Wr represent
or d w
in this section, we will introduce a novel unit model, namely HSU the variables involved in the components. Vector hτi will be the
(Hierarchical Sequential Unit). Hsu will be used as the basic struc- output to both the right sibling node and the father node in ESU.
ture for constructing the EgoCoder model (to be illustrated in the
3.2.2 Inherited Sequential Unit. The component on the right of
next subsection), which involves two sub-units, ESU (Evolutional
t f he
Figure 3 is called the ISU, which accepts input from the left sibling
Sequential Unit) and ISU (Inherited Sequential Unit), for handling
node, i.e., hτi−1 , higher-level father node, i.e., hτi −1 , and generates
the program generation and interpretation tasks respectively. The
No lis
the output for the right sibling node and children nodes at the
general structure of the HSU is provided in Figure 3, where the
lower level. Similar to ESU, there also exists a “forget gate” in ISU
arrows denote the information flow directions, black/red dots repre-
b
for updating some information from the sibling state input. Slightly
sent the concatenation operations of vectors, σ and tanh denote the
different from ESU, the “forget gate” in ISU is controlled by the
pu
class quick_sort:
x1 x2 x3 xn
represent the final output of ISU as input of father HSU node will generate the contents of children
HSU nodes. In this part, we will introduce the EgoCoder model
.
n. ft
hτi = yτi ⊗ sτi ⊗ tanh Wv [h̃τi−1 , h̃τj −1 ]⊤ in great detail to illustrate how to train the model with program
ASTs.
io ra
⊕ (1 ⊖ yτi ) ⊗ sτi ⊗ tanh Wv [hτi−1 , h̃τj −1 ]⊤
3.3.1 Token Raw Encoding. As introduced before, in the pro-
ut d
⊕ yτi ⊗ (1 ⊖ sτi ) ⊗ tanh Wv [h̃τi−1 , hτj −1 ]⊤ gram ASTs, the nodes denote the program components, which
contain program syntax types, token contents and program inten-
ib ing
⊕ (1 ⊖ yτi ) ⊗ (1 ⊖ sτi ) ⊗ tanh Wv [hτi−1 , hτj −1 ]⊤ , tions (optional). Based on the parsing results obtained from the
program, we can obtain the syntax type set and the set of concrete
where gates yτi = σ (Wy [hτi−1 , hτ −1 ]⊤ ), sτi = σ (Ws [hτi−1 , hτ −1 ]⊤ ) keyword, variable, operator and other tokens used in the program,
str rk
and matrices Wy , Ws , Wv denote the variables of ISU. Vector hτi which will be represented as sets C and T respectively. Formally,
will be the output to both the right sibling node as well as all the for each node vi ∈ V in the program ASTs, its representation can
di o
children nodes. be represented as a vector xi = [xci , xit ] ∈ {0, 1} | C |+k · | T | , where
or d w
In sum, the ESU and ISU components covered in the HSU unit xci ∈ {0, 1} | C | and xit ∈ {0, 1}k · | T | represent the one-hot feature
model have a lot in common, as they (1) both have the forget gate, (2) vector of syntax types and tokens respectively and k represents the
both have the evolve/inherit gate, and (3) both combine the original maximal number of tokens contained in the tree nodes. For the tree
states and updated states to generate the output. There also exist
t f he
father nodes to generate output to the children nodes; while ESU ac- Based on the input raw feature vector y from the father node in
cepts input from the children nodes instead and generate output to EgoCoder as illustrated in Figure 4, we can denote its output result
b
the father node. The evolve/inherit gates in ESU and ISU effectively of the father node via the ISU model as
pu
adapt the program context changes between different levels but in hτ = ISU(y, null; WI ),
different directions. In the training and testing stages of EgoCoder,
ESU and ISU will be mainly used as the unit structure for program where null denotes a dummy padding vector and WI covers all the
Un
interpretation and program generation to be introduced as follows. variables involved in the ISU model introduced before.
By feeding hτ as the input to the children nodes at the lower
3.3 Framework Learning level, model EgoCoder will generate the output representations of
the children nodes. We can denote the state and output vectors of
With the HSU introduced before, we can represent the architecture the i th child node as
of EgoCoder in Figure 4, which is also in a tree structured diagram.
hτi +1 = ISU(hτ , hτi−1
+1 ; W ),
(
Based on the ASTs parsed from the input program source code, a I
set of sub-trees will be sampled for training EgoCoder. For the x̂i = softmax(Wdown hτi +1 + bdown ),
n children HSU nodes at the lower level, they are fed with their
+1 denotes the input from the left sibling node, softmax(·)
where hτi−1
raw encoding features and sibling node states as the inputs. Here,
n = dmax denotes the maximum node degree in the program AST, represents the softmax function and hτ0 +1 = null for the first child
and dummy padding will be used for the sub-trees with less than node without left sibling. Wdown and bdown are the variables in-
n children nodes. Among these n children nodes, the data flow is volved to project node state to the output.
bi-directed, which can effectively model the sequential patterns in Compared with the ground-truth representation of the children
dmax
ASTs in both directions. Furthermore, the outputs of the children nodes in the sampled sub-tree, i.e., {xi }i=1 , the loss introduced
HSU nodes will be all fed to a father node at the higher level, which by the ISU model on the sub-tree can be represented as
accepts no sibling node input. The output of the father HSU node max | C |+k
dÕ Õ· | T |
will effectively recover its content. Besides the bottom-up mode, Lisu = −xi [j] log x̂i [j], (1)
EgoCoder can also work well in a top-down mode, where the i=1 j=1
5
which is defined based on the cross-entropy loss function, and index introduced in generating children node tokens as:
j enumerates all the syntax types and tokens involved in the node max (| C |+k
dÕ Õ· | T |)
representation vector. Lhsu = −xi [j] log x̂i−1,r [j] (3)
i=2 j=1
3.3.3 Program Interpretation: Bottom-Up Training of EgoCoder.
(dmax
Õ−1) (| C |+k
Õ· | T |)
On the other hand, besides the top-down direction, we will also
train the EgoCoder model in a bottom-up manner. Based on the + −xi [j] log x̂i+1,l [j]. (4)
i=1 j=1
input for the children nodes, we can generate the contents of the
father node as well. Formally, we can represent the input vectors 3.3.5 Joint Optimization Objective Function. Based on the above
dmax
for the children nodes as {xi }i=1 . By feeding these vectors to the descriptions, we can represent the joint optimization function of
children nodes, we can represent the output vector from the i th model EgoCoder as
child node as vector hτi +1 : min Lisu + α · Lesu + β · Lhsu , (5)
WI ,WE ,WP
.
n. ft
where hτ0 +1 = null for the first children node, and WE represents weights of the last two loss terms (in the experiments α and β are
the variables involved in the ESU model. both assigned with value 1.0).
io ra
Furthermore, based on the children node representations, we Formally, to solve the above objective function, the learning
will be able to represent the state vector and output vector of the process of EgoCoder can be done based on the sub-tree structures
ut d
father node as (involving one parent node and all its child nodes) sampled from
the program AST. To optimize the above loss function, we utilize
ib ing
hτ = ESU(hτ +1 , null; WE ), Stochastic Gradient Descent (SGD) as the optimization algorithm.
(
ŷ = softmax(Wup hτ + bup ), To be more specific, the training process involves multiple epochs.
In each epoch, the training data is shuffled and a minibatch of
str rk
the instances are sampled to update the parameters with SGD. In
where vector hτ +1 = [hτ1 +1 , hτ2 +1 , · · · , hτn+1 ]⊤ contains all the chil-
addition, for each sampled sub-tree, we will feed the EgoCoder
di o
dren node states. Wup and bup are the variables used to project the
model to minimize the loss terms Lisu , Lesu and Lhsu iteratively
or d w
| C |+k To test the effectiveness of the proposed unit model Hsu and the
Õ· | T |
Lesu = −y[j] log ŷ[j]. (2) learning framework EgoCoder, we have conducted extensive exper-
No lis
3.3.4 Program Completion: Sequential Training of EgoCoder following part of this section, we will first introduce the experi-
. In the case when only a fragment of the program is provided mental settings, including dataset descriptions, detailed experiment
pu
for feeding the child nodes, the training process for the ESU will setup, comparison methods and evaluation metrics. After that, the
encounter great challenges, since the incomplete input will mislead experimental results and case studies will be provided and analyzed.
Un
Table 2: Program Generation from Comments. Table 3: Program Completion from a Random Input Line.
methods Accuracy methods Accuracy
EgoCoder 35/43 EgoCoder 34/43
Ast-BiRnn-LSTM 13/43 Ast-BiRnn-LSTM 12/43
Ast-BiRnn-GRU 13/43 Ast-BiRnn-GRU 12/43
Ast-BiRnn-Basic 11/43 Ast-BiRnn-Basic 10/43
Ast-AutoEncoder 0/43 Ast-AutoEncoder 0/43
.
n. ft
BiRnn-LSTM 2/43 BiRnn-LSTM 0/43
BiRnn-GRU 1/43 BiRnn-GRU 0/43
io ra
BiRnn-Basic 2/43 BiRnn-Basic 0/43
ut d
[28] models based on the raw program textual information.
the maximum children node number as dmax , and the maximum Model BiRnn-LSTM splits the program code and comment
ib ing
number of tokens attached to each node as k. For the nodes with into tokens based on the space among them.
less than dmax nodes or k tokens, a dummy one-hot key feature • BiRnn-GRU : Model BiRnn-GRU using GRU [9] as the unit
vector representation will be used for padding. At the same time, for model also splits the program code and comment into tokens
str rk
the AST root node, we extract the textual comments of the whole based on the space among them and infers the following
program as its content, which will be represented as a sequence tokens iteratively.
di o
of natural language tokens actually, and can be modeled with the • BiRnn-Basic: Model BiRnn-Basic is of the same architecture
Hsu model effectively as well. Based on the sampled sub-trees, we
or d w
pare EgoCoder with various existing prediction and generative eration (from comments to program code), program code interpre-
models, which are listed as follows: tation (from program code to program comments), and program
No lis
7
Table 4: Program Interpretation with Input Source Code.
Evaluation Metrics
methods Accuracy Mi-Precision Mi-Recall Mi-F1 Ma-Precision Ma-Recall Ma-F1 W-Precision W-Recall W-F1 Train-Iteration
EgoCoder 0.978±0.072 0.978±0.072 0.978±0.072 0.978±0.072 0.963±0.107 0.963±0.105 0.963±0.107 0.978±0.072 0.978±0.072 0.977±0.073 43,000
Ast-BiRnn-LSTM 0.84±0.198 0.84±0.198 0.84±0.198 0.84±0.198 0.779±0.248 0.793±0.241 0.78±0.248 0.832±0.21 0.84±0.198 0.829±0.208 86,000
Ast-BiRnn-GRU 0.849±0.188 0.849±0.188 0.849±0.188 0.849±0.188 0.787±0.241 0.802±0.232 0.789±0.24 0.84±0.202 0.849±0.188 0.837±0.198 86,000
Ast-BiRnn-Basic 0.798±0.212 0.798±0.212 0.798±0.212 0.798±0.212 0.728±0.254 0.731±0.254 0.724±0.257 0.805±0.213 0.798±0.212 0.793±0.215 86,000
Ast-AutoEncoder 0.321±0.182 0.321±0.182 0.321±0.182 0.321±0.182 0.174±0.144 0.214±0.151 0.183±0.143 0.279±0.192 0.321±0.182 0.285±0.182 4,300,000
BiRnn-LSTM 0.782±0.069 0.782±0.069 0.782±0.069 0.782±0.069 0.769±0.121 0.769±0.121 0.769±0.121 0.782±0.069 0.782±0.069 0.782±0.069 430,000
BiRnn-GRU 0.782±0.069 0.782±0.069 0.782±0.069 0.782±0.069 0.769±0.121 0.769±0.121 0.769±0.121 0.782±0.069 0.782±0.069 0.782±0.069 430,000
BiRnn-Basic 0.751±0.135 0.751±0.135 0.751±0.135 0.751±0.135 0.731±0.183 0.731±0.183 0.731±0.183 0.751±0.135 0.751±0.135 0.751±0.135 430,000
hi = len(seq) - 1
Code Code Code
def partition(seq, left, right, pivot_index):
def partition ( seq , left , right , pivot_index ) :
lo = mid + 1
store_index = left
store_index = left
hi = mid - 1
if seq[i] < pivot_value:
if ( seq [ i ] < pivot_value ) :
else:
seq[i], seq[store_index] = seq[store_index], seq[i]
( seq [ i ] , seq [ store_index ] ) = ( seq [ store_index ] , seq [ i ] )
return mid
store_index += 1
store_index += 1
.
( seq [ store_index ] , seq [ right ] ) = ( seq [ right ] , seq [ store_index ] )
n. ft
return store_index
return store_index
lo = 0
Code
io ra
hi = ( len ( seq ) - 1 )
def sort(seq, left, right):
def sort ( seq , left , right ) :
while ( hi >= lo ) :
if len(seq) <= 1:
if ( len ( seq ) <= 1 ) :
mid = ( lo + ( ( hi - lo ) // 2 ) )
return seq
return seq
ut d
if ( seq [ mid ] < key ) :
elif left < right:
elif ( left < right ) :
lo = ( mid + 1 )
pivot = randrange(left, right)
pivot = randrange ( left , right )
ib ing
hi = ( mid - 1 )
sort(seq, left, pivot_new_index - 1)
else :
return mid
return seq
return False return seq
(a) Binary Search Code. (b) Quick Sort Code.
str rk
Figure 5: Succeeded Examples of EgoCoder. Left: Binary Search; Right: Quick Sort.
di o
Accuracy achieved by Ast-AutoEncoder and also surpasses Ast- Among the baseline methods, EgoCoder can still achieve much
or d w
BiRnn-LSTM, Ast-BiRnn-GRU and Ast-BiRnn-Basic by more better results than the baseline methods with great advantages.
than 10% and outperforms BiRnn-LSTM, BiRnn-GRU and BiRnn- Among all the program code input, EgoCoder can correctly gen-
Basic by more than 22.6%. Similar results can also be observed for erate about 0.978 the program comment tokens. The AST-BiRNN
the other evaluation metrics. In addition, model EgoCoder takes methods can also achieve a very good performance, and they can
t f he
far less iterations before convergence based on the training set. As obtain an average Accuracy around 0.8, which is better than the
shown in Table 1, the iteration required for EgoCoder to converge traditional BiRNN methods. Method Ast-AutoEncoder performs
No lis
pick one line in the program as the input for the models to complete
comments. Here, the generation process involves the iterative in-
the program code (i.e., generate the code ahead of or after the input
ferences of the program tokens at the next lines based on the input
line). Slightly different from the program generation as shown in
program comments without any interactions with the outside world.
Table 2, where the input is the tree root node content, the input
Among the 43 input program comments, EgoCoder is able to gener-
in the program completion can be any lines in the program code,
ate 35 of them without making any mistakes, which outperform the
results obtained in which are slightly lower than those in Table 2.
baseline methods with great advantages. For the AST-BiRNN meth-
Among these 43 programs in the dataset, EgoCoder completes
ods, they can generate 11-13 of the program without any mistakes.
34 of the correctly, which is much better than the other baseline
For the traditional BiRNN methods, they can only generate 1-2
methods. The AST-BiRNN methods can still complete about 10
programs, while Ast-AutoEncoder cannot generate any program
of the programs correctly, while the remaining methods cannot
at all.
complete the program code at all. The program completion task
may require the model to be able to generate contents in both the
4.2.2 Program Interpretation based on Code. In Table 4, we pro-
sequential directions and the hierarchical directions, i.e., ahead of
vide the experimental results of the comparison methods in gen-
the input, after the input, above the input and below the input,
erating the program comments based on the program code input.
which can demonstrate the advantages of the Hsu model compared
Compared with the program code, the program comments are of a
against the traditional RNN models.
shorter length and have less tokens to be predicted, and the evalu-
ation scores achieved by the baseline methods in Table 4 are also 4.2.4 Experimental Discoveries. Generally, according to the pro-
slightly larger than the scores in Table 1. gram generation, program interpretation and program completion
8
from math import sqrt
class UnionFindWithPathCompression:
# continued, the second part
def atkin(limit):
elif self.__rank[x_root] > self.__rank[y_root]:
if limit == 2:
self.__parent[y_root] = x_root
if type(N) != int:
return [2]
5 else:
if limit == 3:
raise TypeError("size_must_be_integer")
if N < 0:
self.__parent[y_root] = x_root
return [2, 3]
raise ValueError("N_cannot_be_a_negative_integer")
self.__rank[x_root] = self.__rank[x_root] + 1
if limit == 5:
self.__parent = []
return [2, 3, 5]
self.__rank = []
def __find(self, x):
if limit < 2:
return []
self.__N = N
if self.__parent[x] != x:
primes = [2, 3, 5]
for i in range(0, N):
self.__parent[x] = self.__find(self.__parent[x])
sqrt_limit = int(sqrt(limit)) + 1
self.__rank.append(0)
raise TypeError("x_must_be_integer")
return x
6
n = 4 * x ** 2 + y ** 2
n = 3 * x ** 2 + y ** 2
self.__parent.append(x)
n = 3 * x ** 2 - y ** 2
self.__validate_ele(y)
.
if is_prime[index]:
n. ft
y_root = self.__find(y)
2 if x_root == y_root:
def __validate_ele(self, x):
is_prime[composite] = False
return
if type(x) != int:
io ra
for index in range(7, limit):
if is_prime[index]:
self.__parent[x_root] = y_root
if x < 0 or x >= self.__N:
primes.append(index)
ut d
return primes
# the first part
(a) Sieve of Atkin Code. (b) Union Find Class Code.
ib ing
Figure 6: Examples that EgoCoder Fail to Handle. Left: Sieve of Atkin Algorithm; Right: Union Find Class Class. (The dupli-
cated contents are highlighted in colored bolded font).
str rk
experimental results, EgoCoder performs very well in inferring 4.3.2 Failed Cases. Besides the succeeded cases aforementioned,
both the contents at both the children nodes and father node. The in Figure 6, we also provide two cases that EgoCoder cannot han-
di o
main reason can be due to that the Hsu model effectively cap- dle well, especially in program code generation and completion.
or d w
tures both the sequential and hierarchical information patterns in The left program code is about the Sieve of Atkin algorithm, and
the ASTs, which performs much more effectively than the models the right code is about the Union Find class with compressed path.
merely capturing the sequential patterns. In addition, the AST will The main problem with the code is that it contains so many du-
also greatly improve the model performance, since the tree diagram plicated contents. In the blocks, we can identify several common
t f he
will effectively help the models outline the program hierarchical statements (in the same colors), which will make the EgoCoder
structure. The Ast-AutoEncoder model cannot achieve a good fail to work. For instance, in the Union Find class code, given a
No lis
performance in these content generation tasks. statement “self.__validate_ele(y)” (in bolded blue font), it will be
very hard for the model to generate the statement after it, since
b
4.3 Case Study there are two different options “x_root = self.__find(x)” and “re-
turn self.find(x) == self.find(y)”. Such a problem can be hopefully
pu
In this part, we will provide a study about the succeeded and failed
addressed by incorporate a even deeper architecture in EgoCoder.
cases of EgoCoder in the experiments.
Just like this failed case, if we can effectively incorporate the father
Un
.
Program Synthesis: The problem studied in this paper is also for speech recognition and related applications: An overview. In ICASSP, 2013.
n. ft
[12] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.
closely related with the program synthesis problem studied in soft- http://www.deeplearningbook.org.
io ra
ware engineering. Formally, the goal of software program synthesis [13] S. Hill. Elite and upper-class families. In Families: A Social Class Perspective. 2012.
is to generate programs automatically from high-level specifica- [14] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke,
P. Nguyen, T. Sainath, and B. Kingsbury. Deep neural networks for acoustic
ut d
tions, lots of research works have been done on this topic already. modeling in speech recognition. IEEE Signal Processing Magazine, 2012.
Program synthesis is a challenging problem, which may require [15] G. Hinton, S. Osindero, and Y. Teh. A fast learning algorithm for deep belief nets.
ib ing
external supervisions from either template [30], examples and type Neural Comput., 2006.
[16] S. Hochreiter and J Schmidhuber. Long short-term memory. Neural Comput.,
information [26], and oracles [18]. In [18], the authors present a 1997.
novel approach to automatic synthesis of loop-free programs based [17] H. Jaeger. Tutorial on training recurrent neural networks, covering BPPT, RTRL,
str rk EKF and the “echo state network” approach. Technical report, Fraunhofer Institute
on a combination of oracle-guided learning from examples. Based for Autonomous Intelligent Systems (AIS), 2002.
on the program templates, [30] introduces an approach to generate [18] S. Jha, S. Gulwani, S. Seshia, and A. Tiwari. Oracle-guided component-based
di o
the programs from the templates. Osera et al. [26] introduce an al- program synthesis. In ICSE, 2010.
[19] I. Konstas and M. Lapata. A global model for concept-to-text generation. J. Artif.
or d w
gorithm for synthesizing recursive functions that process algebraic Int. Res., 2013.
datatypes, which exploits both type information and input-output [20] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep
examples to prune the search space. Some other program synthesis convolutional neural networks. In NIPS, 2012.
[21] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. Berg, and T. Berg. Baby talk:
works address the problem with recursive algorithms [5], deductive
t f he
program generation, which covers the tasks about program code [25] A. Mnih and G. Hinton. A scalable hierarchical distributed language model. In
pu
10