0% found this document useful (0 votes)
50 views10 pages

EgoCoder - Intelligent Program Synthesis With Hierarchical Sequential Neural Network Model

EgoCoder- Intelligent Program Synthesis With Hierarchical Sequential Neural Network Model

Uploaded by

Jexia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views10 pages

EgoCoder - Intelligent Program Synthesis With Hierarchical Sequential Neural Network Model

EgoCoder- Intelligent Program Synthesis With Hierarchical Sequential Neural Network Model

Uploaded by

Jexia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

1

2
EgoCoder: Intelligent Program Synthesis with Hierarchical 59
60
3 Sequential Neural Network Model 61
4 62

Zhang1 , Cui2 , Gouza1


5 63
6
Jiawei Limeng Fisher B. 64
1 IFM
Lab, Department of Computer Science, Florida State University, FL, USA
7 65
2 School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing, China
8 66
9 [email protected],[email protected],[email protected] 67
10 68
11
ABSTRACT To learn necessary programing skills, a long-time systematic train- 69
12 Programming has been an important skill for researchers and prac- ing is usually required for beginners. Generally, to be a qualified 70
arXiv:1805.08747v1 [cs.AI] 22 May 2018

13 titioners in computer science and other related areas. To learn programmer, people may need to master knowledge from various 71
14 basic programing skills, a long-time systematic training is usually areas, including programing language, discrete mathematics, data 72
15 required for beginners. According to a recent market report, the structure and algorithm, etc. 73

.
computer software market is expected to continue expanding at an Computer programing continues to be a necessary and important

n. ft
16 74
17 accelerating speed, but the market supply of qualified software de- skill for both academic researchers and industry practitioners as the 75

io ra
18 velopers can hardly meet such a huge demand. In recent years, the Internet and AI applications continue to expand. As introduced in 76
19 surge of text generation research works provides the opportunities [2], the computer software market is expanding at an accelerating 77

ut d
20 to address such a dilemma through automatic program synthesis. In speed and is estimated to grow from 19.98 Billion USD in 2014 to 78
this paper, we propose to make our try to solve the program synthe- more than 50.34 Billion USD in 2022. Meanwhile, according to the

ib ing
21 79
22 sis problem from a data mining perspective. To address the problem, latest market analysis report [3], there exists a huge gap between 80
23 a novel generative model, namely EgoCoder, will be introduced in the market supply and demand of software developers. For instance, 81
this paper. EgoCoder effectively parses program code into abstract from January 2016 to February 2017, more than 115, 000 job postings
24
str rk 82
25 syntax trees (ASTs), where the tree nodes will contain the program requesting for qualified software engineers have been posted in 83
26 code/comment content and the tree structure can capture the pro- each month, but the average monthly hire number is merely 33, 579. 84
di o
27 gram logic flows. Based on a new unit model called Hsu, EgoCoder Such a huge demand-supply gap also motivates many large IT 85
or d w

28 can effectively capture both the hierarchical and sequential patterns companies to seek for other ways to address such a problem. 86
29 in the program ASTs. Extensive experiments will be done to com- For effective program code storage and maintenance, inside all 87
30 pare EgoCoder with the state-of-the-art text generation methods, the well-known big IT and related technology companies, they are 88
and the experimental results have demonstrated the effectiveness maintaining a company-internal program codebase for storing all
t f he

31 89
32 of EgoCoder in addressing the program synthesis problem. the developed program code of company systems, web services, 90
software products and research projects. The program code in these
No lis

33 91
34 KEYWORDS codebases is normally of a tremendous amount. A recent report [1] 92
releases the lines of code used in several companies and software
35 Program Synthesis; Text Generation; Neural Networks; Data Mining 93
b

36
systems, among which Google ranks the top with more than 2 94
pu

37 ACM Reference Format: billion lines of code [4] used in all its Internet services. These 95
38
Jiawei Zhang1 , Limeng Cui2 , Fisher B. Gouza1 . 2018. EgoCoder: Intelligent company codebase repositories cover very diverse yet high-quality 96
Program Synthesis with Hierarchical Sequential Neural Network Model. In code, which are also the most valuable intellectual property of
Un

39 97
Proceedings of ACM Conference (Conference’17). ACM, New York, NY, USA, companies, but fail to be effectively exploited.
40 98
10 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn Programing has been long-time treated as one of the most chal-
41 99
42
lenging skills mastered by a very small number of people from 100
43 1 INTRODUCTION some untrained eyes. In this paper, we will make our try to attack 101
44 Formally, programing denotes the process of developing and imple- this holy-grail pride of software engineers by training a model to 102
45 menting computer instructions to enable a computer to perform write programs automatically. The automatic program synthesis 103
46 certain tasks. These instructions are usually written in one or sev- problem is a fundamental problem from the technology, business 104
47 eral programing languages, and a sequence of computer instruc- and society development perspectives. Successfully addressing the 105
48 tions (implementing the pre-specified functions) will be called a problem will effectively bridge the market supply&demand gap 106
49 computer program, which helps the computer to operate smoothly. for qualified practitioners, greatly stimulate the development of IT 107
50
and other related areas, intelligently recycle the company internal 108
51
codebase for secondary-development, and promisingly free human 109
Permission to make digital or hard copies of part or all of this work for personal or
52 classroom use is granted
Unpublished without feedraft.
working providedNot
that copies are not made or distributed
for distribution. from the tedious coding positions to other more challenging jobs. 110
53
for profit or commercial advantage and that copies bear this notice and the full citation In recent years, due to the surge of deep learning developments 111
on the first page. Copyrights for third-party components of this work must be honored.
54 For all other uses, contact the owner/author(s).
[12], many text generation research works and models have been 112
55 Conference’17, July 2017, Washington, DC, USA proposed, which introduce many novel yet interesting research 113
56
© 2018 Copyright held by the owner/author(s). problems. Meanwhile, slightly different from the unstructured sen- 114
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. tences written in natural languages, the program code written in
57 https://doi.org/10.1145/nnnnnnn.nnnnnnn 115
58 2018-05-23 00:25. Page 1 of 1–10. 116
Program Source Code Program Hierarchy
from random import randrange
Module
programing languages is highly structured, which can be precisely class quick_sort:

parsed into a hierarchical structure according to the specific pro-


def partition(seq, left, right, pivot_index):
pivot_value = seq[pivot_index] contains contains
seq[pivot_index], seq[right] = seq[right], seq[pivot_index]

gramming language grammar. For instance, for the program written


store_index = left
for i in range(left, right):
Import sequential
if seq[i] < pivot_value:
Class
in an advanced programing language, Python, its code will consist Statement
seq[i], seq[store_index] = seq[store_index], seq[i]
store_index += 1
seq[store_index], seq[right] = seq[right], seq[store_index]
contains contains
of hierarchical structures like class, function, statements and expres- return store_index

def sort(seq, left, right):

sions, etc. Therefore, instead of handling the program characters by if len(seq) <= 1:
return seq
elif left < right:
Class
Definition
sequential
Function
sequential
Function
Statement
characters (like the existing text generation research works [31]), pivot = randrange(left, right)
pivot_new_index = partition(seq, left, right, pivot)
sort(seq, left, pivot_new_index - 1)

new techniques that can handle the program according to its own sort(seq, pivot_new_index + 1, right)
return seq
FUNCTION ASSIGN ASSIGN ASSIGN FOR ASSIGN RETURN

structure will be necessary.


Definition
Statement Statement Statement Statement Statement Statement
Statement

The automatic program synthesis problem is extremely challeng- Figure 1: An Example of Program Abstract Syntax Tree.
ing to solve due to several reasons:
• Lack of Problem Definition: The automatic program synthesis sibling nodes at the same levels, as well as evolving information
problem is still an open problem to this context so far. A from the child nodes and inheriting information from the father
formal definition of automatic program synthesis will be node simultaneously. Based on a set of sampled sub-tree batches
from the extracted program ASTs, EgoCoder can be trained effec-

.
required before proposing potential solutions to address it.

n. ft
• Program Hierarchical Structure Extraction: There usually ex- tively to capture the substructures covered in the ASTs. These new
technical terms mentioned above will be clearly illustrated in great

io ra
ists a concrete hierarchical-sequential structure of program
code according to its logic flows hierarchically and sequen- details in this paper.

ut d
tially. Generally, code tokens at the lower level of programs
will precisely implement the desired physical functions of
2 PROBLEM FORMULATION

ib ing
the program components at higher levels; meanwhile, at In this section, we will first define several important concepts used
each level, the logic will flow in a sequential manner from in this paper, based on which we will provide the formulation of
the beginning to the end. Extraction of such a hierarchical- the studied problem and its three different running modes.
str rk
sequential program structure will be useful for effective pro-
gram information modeling and representation learning. 2.1 Terminology Definition
di o
• Unit Model: For each component in the hierarchical-sequential Computer program usually has a highly structured hierarchy, in-
or d w

structure aforementioned, depending on the specific run- volving the code components belonging to different syntax types.
ning mode, it will accept the input from the components
Definition 2.1. (Program Syntax Type): Formally, we can repre-
above/below and before/after the component. A new unit
sent the set of syntax types involved in the program as set C =
model for implementing such an intertwined relationships
t f he

{module, class, function, statement, expression} ∪ {unit token syn-


in the learning process will be desired.
tax type}, where the unit token syntax type set involves various
• Program Intention Incorporation: Besides the program code
variable and operator types used in the program.
No lis

itself, there usually exist some textual descriptions of the


program code in a natural language, which indicates the Based on the program syntax type set, we can translate a program
b

physical function of the program, e.g., ranking, shuffling, into a program abstract syntax tree, where the nodes denote program
pu

searching, factorization and dynamic programing, etc. Effec- code components (i.e., code blocks) belonging to different syntax
tively incorporating the program intention into the learning types, and the links represent the semantic logic flows among the
process will allow both program generation and interpreta- code components.
Un

tion across natural languages and programing languages.


Definition 2.2. (Program AST): Formally, a program AST can
To effectively resolve the above challenges, in this paper, we be represented as a graph structured diagram: T = (V, E, root),
will introduce a novel neural network model, namely EgoCoder, where V denotes the set of program component nodes, and E
with a deep architecture. EgoCoder provides a formal definition denotes the set of logic-flow relationships among the nodes at either
of the automatic program synthesis problem, which covers three different hierarchical levels or at the same hierarchical levels. In T ,
different sub-problems respectively: program generation, program the root ∈ V represents the top program component node, which
interpretation and program completion. Instead of learning the usually denotes the program module component by default.
models based on the pure text information in the program code,
EgoCoder extracts the hierarchical-sequential structure with a Definition 2.3. (Program Component Node): Each program com-
programming language parser, which translates the input program ponent node v ∈ V in the program AST can be denoted as a triple
code into abstract syntax tree (AST) structured diagrams. For each v = (c, t, f ), where c ∈ C denotes its syntax type, t represents its
node in the extracted ASTs, it contains both syntax types and tokens textual content and f denotes the functional intention of the pro-
as its content. Meanwhile, the structure of the extracted ASTs will gram component. The overall program intention can be represented
also effectively indicate the semantic logic flows of the program. as the AST root node intention by default.
To capture both the syntax contents of the program components For instance, as shown in Figure 1, given the input program on
and the semantical logic flow of the program, a new unit model, the left, we can represent its corresponding program AST on the
namely Hsu (hierarchical sequential unit), will be used as the basic right, where the top program component is module. The program
component in EgoCoder. Unit model Hsu can accept inputs from module covers the one import statement and one class component,
2
Statement 1 Syntax Tree 1
Statement
pivot_value = seq[pivot_index]

Statement 2 Variable
Token
Operator
Token Expression

for i in range(left, right):


which further involves two function components. The first func- if seq[i] < pivot_value:
seq[i], seq[store_index] = seq[store_index], seq[i]
pivot_value =
Variable
Token
Operator
Token
Variable
Token

tion componennt contains multiple statements with sequential rela- store_index += 1


seq [] pivot_index

tionships, and each statement further contains multiple sequential Syntax Tree 2
Statement

expressions, i.e., sequences of tokens. Different from natural lan- For


Condition
For
Body
Statement Statement

guage, the program code is well-structured, and each token also If If

has a corresponding concept denoting its type, e.g., key words vs


Reserved Variable Reserved Condition Body
Expression
Token Token Token Statement Statement

variables vs operators, which can be precisely extracted with the for i in


Function
Token
Variable
Token
Variable
Token
Reserved
Token
Expression
Operator
Token
Variable
Token Statement Statement

corresponding programming language interpreter/parser. range() left right if < pivot_value

Variable Operator Variable Operator Operator Operator Variable Operator Variable


Expression Expression Expression Expression
Token Token Token Token Token Token Token Token Token

2.2 Problem Formulation seq [] i ,


Variable Operator Variable Variable Operator Variable
= , store_index += 1

Variable Operator Variable Variable Operator Variable

The automatic program synthesis problem studied in this paper


Token Token Token Token Token Token Token Token Token Token Token Token

seq [] i seq [] store_index seq [] store_index seq [] i

actually covers three sub-problems simultaneously, each of which


Figure 2: An Example of Program Abstract Syntax Tree.
describes a special case of the “problem synthesis” problem. For-
mally, these three sub-problems covered in the automatic program
synthesis problem are illustrated as follows:

.
of characters, i.e., ‘p’, ‘i’, ‘v’, ‘o’, · · · , ‘x’, ‘]’, or separate the string

n. ft
• Program Generation: Given the program intention of the top by certain characters among them. Neither of these two partition

io ra
program module component in the program AST, the pro- methods will work well for programs, and they will also create lots
gram generation problem aims at generating the program of problems for modeling the program code and understanding the

ut d
source code that can implement the specified intentions. program intention.
• Program Interpretation: With the complete program source In addition, in most of the cases, program operators will be deeply

ib ing
code or merely a fragment, the program interpretation prob- buried in the variables. For instance, the expression “seq[pivot_index]”
lem aims at inferring the potential intention of the program, actually represents an entry in a list, where “[]” is an operator. With-
i.e., interpreting the physical functions of the program code. out differentiating ‘[’ and ‘]’ from the remaining characters, it is
str rk
• Program Completion: Given a fragment of the program code, highly likely that we will treat “seq[pivot_index]” merely as a new
which can be either a function or merely several statements variable name and fail to process the code correctly. In this paper,
di o
of the code, the program completion problem aims at com- to resolve such a problem, we propose to parse the program code
or d w

pleting the missing components of the program. lines into a program AST instead.
For instance, in Figure 2, we show two examples of program
3 PROPOSED METHODS ASTs corresponding to two input program statements. The first
In this section, we will introduce the EgoCoder framework to solve statement involves the assignment of value “seq[pivot_index]” to
t f he

the automatic program synthesis problem (including all these three variable “pivot_value”. In its AST, we have “pivot_value”, “seq” and
aforementioned sub-problems). Framework EgoCoder involves “pivot_index” as the variable tokens, and “=” and “[]” as the operator
No lis

several crucial steps: (1) program parsing, (2) hierarchical sequential tokens. Furthermore, “seq”, “[]” and “pivot_index” together will
statement encoding with Hsu, and (3) framework learning. In the compose an expression in the syntax tree. For the nodes in the
b

following part of this section, we will introduce these three steps same level, i.e., the siblings, we will add sequential links connecting
pu

in great detail. them, which are denoted by the dashed links as shown in Figure 2.
The second example shown in Figure 2 is more complicated, it is
a “FOR”-statement. According to the provided syntax tree shown in
Un

3.1 Program Parsing


Different from natural languages, the program written in program- the figure, this statement contains the “FOR-Condition”-statement
ing languages is highly structured. Instead of handling the code and “FOR-Body”-statement as the child nodes of the root. For the
characters by characters, we propose to translate the program code “FOR-Condition”-statement, it starts with a reserved keyword to-
into program ASTs in this paper, which will be taken as the input ken “for”, followed by variable token and another reserved keyword
for modeling to be introduced in the next subsection. For instance, token “in” respectively, and ends with an expression “range(left,
given a program statement “pivot_value = seq[pivot_index]”, it right)” (involving function call token “range()” as well as variable to-
assigns an entry (with index “pivot_index”) from list “seq” to a kens “left” and “right”). Furthermore, in the “FOR-Body”-statement,
variable “pivot_value”, where “=” and “[]” are the operators, and it contains an “IF”-statement, involving both the “IF-Condition”-
“pivot_value”, “seq”, “pivot_index” denote the assignment target, statement and “IF-Body”-statement respectively.
source list, and index variables respectively. For many programming For long programs, their ASTs will be in an extremely deep
languages, like Python, the space among the tokens has no impact structure, which may cause many computational problems in model
on the program functions. For instance, the program statement learning. In this paper, we will allow EgoCoder to truncate ASTs
“pivot_value=seq[pivot_index]” (with no space between the tokens) to shrink the tree depth. For instance, if we use statement as the
will work exactly as “pivot_value = seq [ pivot_index ]” (with to- smallest basic syntax type in the AST leaf nodes, then the ASTs of
kens well separated by the space). However, such a characteristic program statements 1 and 2 in Figure 2 will be of a much simpler
will create lots of challenges for partitioning the program line into structure, whose involved nodes are marked in green circles in
unit tokens. Traditional text mining and natural language process- Figure 2. There exist some open-source tools which can generate
ing techniques will either partition the code line into a sequence the syntax tree of Python code automatically, e.g., the Python AST
3
h⌧j 1

h⌧i
h⌧i 1
the changes from the lower-level program expression to higher-
fi⌧ z⌧i gi⌧ yi⌧ level program statement, which is effective to represent the changes
in the scope of variables and other program context information
X X

h̃⌧i 1
1- h̃⌧i 1 1-

across levels in program ASTs. Formally, we can represent the


+ +
tanh tanh
“evolve gate” as well as the updated children node state vector as
X X

tanh X tanh X
tanh X
tanh X
tanh X
tanh X

h̃⌧ +1
h̃τ +1 = eτi ⊗ hτ +1 , where eτi = σ We hτi−1 , hτ +1
h̃⌧j 1
 ⊤
,

1- 1-
X X
h⌧ +1 e⌧i r⌧i ESU t⌧i s⌧i ISU
where We denotes the variable matrix in the “evolve gate” in ESU.
ESU computes the output with the original inputs from sibling
HSU
and children nodes, i.e., hτi−1 , hτ +1 , as well as the updated sibling-
h⌧1 +1 h⌧2 +1 …… h⌧n+1 node state vector h̃τi−1 and the evolved child-node state vector
Figure 3: The Hierarchical Sequential Unit (HSU) Model h̃τ +1 . ESU allows different combinations of the state vectors, which
are controlled by two new selection gates zτi and rτi respectively.
package1 . With these tools, instead of modeling the program raw

.
Formally, we can represent the final output of ESU as

n. ft
textual code, we can translate the program into its AST, and the

io ra
following learning steps will be all based on the obtained ASTs by hτi = zτi ⊗ rτi ⊗ tanh Wu [h̃τi−1 , h̃τ +1 ]⊤
 

default.

ut d
⊕ (1 ⊖ zτi ) ⊗ rτi ⊗ tanh Wu [hτi−1 , h̃τ +1 ]⊤
 

3.2 Hierarchical Sequential Unit (HSU)

ib ing
⊕ zτi ⊗ (1 ⊖ rτi ) ⊗ tanh Wu [h̃τi−1 , hτ +1 ]⊤
 
As shown in the constructed ASTs, among the nodes in the tree
structured diagram, there exist two different relationship types: hi-
⊕ (1 ⊖ zτi ) ⊗ (1 ⊖ rτi ) ⊗ tanh Wu [hτi−1 , hτ +1 ]⊤ ,
 
erarchical relationship between the father nodes and children nodes
str rk
at different levels, and sequential relationship between sibling nodes
at the same levels. To effectively model the contents of the nodes as where zτi = σ (Wz [hτi−1 , hτ +1 ]⊤ ), rτi = σ (Wr [hτi−1 , hτ +1 ]⊤ ), and 1
di o
well as the hierarchical-sequential relationships among the nodes, denotes a vector filled with value 1. Matrices Wu , Wz , Wr represent
or d w

in this section, we will introduce a novel unit model, namely HSU the variables involved in the components. Vector hτi will be the
(Hierarchical Sequential Unit). Hsu will be used as the basic struc- output to both the right sibling node and the father node in ESU.
ture for constructing the EgoCoder model (to be illustrated in the
3.2.2 Inherited Sequential Unit. The component on the right of
next subsection), which involves two sub-units, ESU (Evolutional
t f he

Figure 3 is called the ISU, which accepts input from the left sibling
Sequential Unit) and ISU (Inherited Sequential Unit), for handling
node, i.e., hτi−1 , higher-level father node, i.e., hτi −1 , and generates
the program generation and interpretation tasks respectively. The
No lis

the output for the right sibling node and children nodes at the
general structure of the HSU is provided in Figure 3, where the
lower level. Similar to ESU, there also exists a “forget gate” in ISU
arrows denote the information flow directions, black/red dots repre-
b

for updating some information from the sibling state input. Slightly
sent the concatenation operations of vectors, σ and tanh denote the
different from ESU, the “forget gate” in ISU is controlled by the
pu

sigmoid and hyperbolic tangent functions respectively, and icons


states of sibling and father nodes, which together with the updated
⊗, ⊕ represent the entry-wise vector product and sum operators.
input from the left-sibling node can be represented as follows:
Un

3.2.1 Evolutional Sequential Unit. In Figure 3, the component  i⊤


on the left is an ESU, which accepts the input from the children
h
h̃τi−1 = gτi ⊗ hτi−1 , where gτi = σ Wд hτi−1 , hτj −1 .
nodes, i.e., hτ1 +1 , hτ2 +1 , · · · , hτn+1 and the left sibling node, i.e., hτi−1 .
For the input from sibling node, ESU adopts a “forget gate”, which
may choose one part of hτi−1 to update. In programs, the scope Here, Wд is the variable of the “forget gate” in ISU.
of variables can be different between statements, which may be Another significant difference between ISU and ESU is, for in-
updated as the code runs into a new statement. Formally, we can heriting and updating the program context from the father node,
represent the “forget gate” together with the updated left-sibling e.g., the scopes of variables and other program information, ISU
node state as has an “inherit gate” for changing the input states of the father
node. Formally, we can represent the “inherit gate” together with
h̃τi−1 = fiτ ⊗ hτi−1 , where fiτ = σ Wf hτi−1 , hτ +1
 ⊤
.

the updated input from the father node as
Here, hτ +1 = hτ1 +1 , hτ2 +1 , · · · , hτn+1 denotes the concatenated in-
 
 h i⊤
put state vector from the children nodes and matrix Wf represents h̃τj −1 = tτi ⊗ hτj −1 , where tτi = σ Wt hτi−1 , hτj −1 ,
the variables of the “forget gate” in ESU.
Meanwhile, for the inputs from the children nodes, ESU intro- where Wt is the variable of the “inherit gate” in ISU.
duces a gate, namely the “evolve gate”, which can evolve the chil- ISU will compute the final output based on the combination
dren input states to the upper level. Here, the term “evolve” models of the original input vectors and the updated vectors, which is
1 https://docs.python.org/2/library/ast.html controlled by the gates yτi and sτi respectively. Formally, we can
4
y
from random import randrange

class quick_sort:

def partition(seq, left, right, pivot_index):


pivot_value = seq[pivot_index]
seq[pivot_index], seq[right] = seq[right], HSU
seq[pivot_index]
store_index = left sub-tree batch
for i in range(left, right): program parse
sampling ……
if seq[i] < pivot_value: into AST
seq[i], seq[store_index] = seq[store_index], for model learning
seq[i]
store_index += 1
seq[store_index], seq[right] = seq[right],
seq[store_index] HSU HSU HSU HSU
return store_index

def sort(seq, left, right): ……


if len(seq) <= 1:
return seq
elif left < right: HSU HSU HSU HSU
pivot = randrange(left, right)
pivot_new_index = partition(seq, left, right, pivot)
sort(seq, left, pivot_new_index - 1)
sort(seq, pivot_new_index + 1, right)
return seq ……

x1 x2 x3 xn

Figure 4: The Architecture of EgoCoder.

represent the final output of ISU as input of father HSU node will generate the contents of children
HSU nodes. In this part, we will introduce the EgoCoder model

.
n. ft
 
hτi = yτi ⊗ sτi ⊗ tanh Wv [h̃τi−1 , h̃τj −1 ]⊤ in great detail to illustrate how to train the model with program
ASTs.

io ra
 
⊕ (1 ⊖ yτi ) ⊗ sτi ⊗ tanh Wv [hτi−1 , h̃τj −1 ]⊤
3.3.1 Token Raw Encoding. As introduced before, in the pro-

ut d
 
⊕ yτi ⊗ (1 ⊖ sτi ) ⊗ tanh Wv [h̃τi−1 , hτj −1 ]⊤ gram ASTs, the nodes denote the program components, which
contain program syntax types, token contents and program inten-

ib ing
 
⊕ (1 ⊖ yτi ) ⊗ (1 ⊖ sτi ) ⊗ tanh Wv [hτi−1 , hτj −1 ]⊤ , tions (optional). Based on the parsing results obtained from the
program, we can obtain the syntax type set and the set of concrete
where gates yτi = σ (Wy [hτi−1 , hτ −1 ]⊤ ), sτi = σ (Ws [hτi−1 , hτ −1 ]⊤ ) keyword, variable, operator and other tokens used in the program,
str rk
and matrices Wy , Ws , Wv denote the variables of ISU. Vector hτi which will be represented as sets C and T respectively. Formally,
will be the output to both the right sibling node as well as all the for each node vi ∈ V in the program ASTs, its representation can
di o
children nodes. be represented as a vector xi = [xci , xit ] ∈ {0, 1} | C |+k · | T | , where
or d w

In sum, the ESU and ISU components covered in the HSU unit xci ∈ {0, 1} | C | and xit ∈ {0, 1}k · | T | represent the one-hot feature
model have a lot in common, as they (1) both have the forget gate, (2) vector of syntax types and tokens respectively and k represents the
both have the evolve/inherit gate, and (3) both combine the original maximal number of tokens contained in the tree nodes. For the tree
states and updated states to generate the output. There also exist
t f he

nodes with less than k tokens, dummy padding will be adopted.


many difference between ESU and ISU. Besides the input/output
among the sibling nodes, ISU also accepts input from higher-level 3.3.2 Program Generation: Top-Down Training of EgoCoder.
No lis

father nodes to generate output to the children nodes; while ESU ac- Based on the input raw feature vector y from the father node in
cepts input from the children nodes instead and generate output to EgoCoder as illustrated in Figure 4, we can denote its output result
b

the father node. The evolve/inherit gates in ESU and ISU effectively of the father node via the ISU model as
pu

adapt the program context changes between different levels but in hτ = ISU(y, null; WI ),
different directions. In the training and testing stages of EgoCoder,
ESU and ISU will be mainly used as the unit structure for program where null denotes a dummy padding vector and WI covers all the
Un

interpretation and program generation to be introduced as follows. variables involved in the ISU model introduced before.
By feeding hτ as the input to the children nodes at the lower
3.3 Framework Learning level, model EgoCoder will generate the output representations of
the children nodes. We can denote the state and output vectors of
With the HSU introduced before, we can represent the architecture the i th child node as
of EgoCoder in Figure 4, which is also in a tree structured diagram.
hτi +1 = ISU(hτ , hτi−1
+1 ; W ),
(
Based on the ASTs parsed from the input program source code, a I
set of sub-trees will be sampled for training EgoCoder. For the x̂i = softmax(Wdown hτi +1 + bdown ),
n children HSU nodes at the lower level, they are fed with their
+1 denotes the input from the left sibling node, softmax(·)
where hτi−1
raw encoding features and sibling node states as the inputs. Here,
n = dmax denotes the maximum node degree in the program AST, represents the softmax function and hτ0 +1 = null for the first child
and dummy padding will be used for the sub-trees with less than node without left sibling. Wdown and bdown are the variables in-
n children nodes. Among these n children nodes, the data flow is volved to project node state to the output.
bi-directed, which can effectively model the sequential patterns in Compared with the ground-truth representation of the children
dmax
ASTs in both directions. Furthermore, the outputs of the children nodes in the sampled sub-tree, i.e., {xi }i=1 , the loss introduced
HSU nodes will be all fed to a father node at the higher level, which by the ISU model on the sub-tree can be represented as
accepts no sibling node input. The output of the father HSU node max | C |+k
dÕ Õ· | T |
will effectively recover its content. Besides the bottom-up mode, Lisu = −xi [j] log x̂i [j], (1)
EgoCoder can also work well in a top-down mode, where the i=1 j=1
5
which is defined based on the cross-entropy loss function, and index introduced in generating children node tokens as:
j enumerates all the syntax types and tokens involved in the node max (| C |+k
dÕ Õ· | T |)
representation vector. Lhsu = −xi [j] log x̂i−1,r [j] (3)
i=2 j=1
3.3.3 Program Interpretation: Bottom-Up Training of EgoCoder.
(dmax
Õ−1) (| C |+k
Õ· | T |)
On the other hand, besides the top-down direction, we will also
train the EgoCoder model in a bottom-up manner. Based on the + −xi [j] log x̂i+1,l [j]. (4)
i=1 j=1
input for the children nodes, we can generate the contents of the
father node as well. Formally, we can represent the input vectors 3.3.5 Joint Optimization Objective Function. Based on the above
dmax
for the children nodes as {xi }i=1 . By feeding these vectors to the descriptions, we can represent the joint optimization function of
children nodes, we can represent the output vector from the i th model EgoCoder as
child node as vector hτi +1 : min Lisu + α · Lesu + β · Lhsu , (5)
WI ,WE ,WP

hτi +1 = ESU(xi , hτi−1


+1
; WE ), where WP covers all the variables adopted to project the state
vector to the output space introduced above, and α, β denote the

.
n. ft
where hτ0 +1 = null for the first children node, and WE represents weights of the last two loss terms (in the experiments α and β are
the variables involved in the ESU model. both assigned with value 1.0).

io ra
Furthermore, based on the children node representations, we Formally, to solve the above objective function, the learning
will be able to represent the state vector and output vector of the process of EgoCoder can be done based on the sub-tree structures

ut d
father node as (involving one parent node and all its child nodes) sampled from
the program AST. To optimize the above loss function, we utilize

ib ing
hτ = ESU(hτ +1 , null; WE ), Stochastic Gradient Descent (SGD) as the optimization algorithm.
(

ŷ = softmax(Wup hτ + bup ), To be more specific, the training process involves multiple epochs.
In each epoch, the training data is shuffled and a minibatch of
str rk
the instances are sampled to update the parameters with SGD. In
where vector hτ +1 = [hτ1 +1 , hτ2 +1 , · · · , hτn+1 ]⊤ contains all the chil-
addition, for each sampled sub-tree, we will feed the EgoCoder
di o
dren node states. Wup and bup are the variables used to project the
model to minimize the loss terms Lisu , Lesu and Lhsu iteratively
or d w

father node state to the its output.


for parameter learning. Such a process continues until convergence.
The introduced loss based on the input sub-tree by comparing ŷ
with the ground-truth vector y can be represented as 4 EXPERIMENTS
t f he

| C |+k To test the effectiveness of the proposed unit model Hsu and the
Õ· | T |
Lesu = −y[j] log ŷ[j]. (2) learning framework EgoCoder, we have conducted extensive exper-
No lis

j=1 iments on a real-world program-comment dataset, and compared


EgoCoder with several existing text generation methods. In the
b

3.3.4 Program Completion: Sequential Training of EgoCoder following part of this section, we will first introduce the experi-
. In the case when only a fragment of the program is provided mental settings, including dataset descriptions, detailed experiment
pu

for feeding the child nodes, the training process for the ESU will setup, comparison methods and evaluation metrics. After that, the
encounter great challenges, since the incomplete input will mislead experimental results and case studies will be provided and analyzed.
Un

the model to generate a wrong output. This happens very often,


since missing any line of the program code will introduce an incom- 4.1 Experimental Setting
plete sub-tree structured diagram in the program AST. To resolve 4.1.1 Dataset Description. In the experiments, we will take the
such a problem, we propose to generate the complete child node program code written in Python programing language as an exam-
input information based on the program fragments by training the ple. The program dataset used in the experiments covers the Python
bi-directed HSU structure in EgoCoder. implementation code of basic algorithms, like different sort algo-
As introduced before, based on the input of children nodes rithms, search algorithms, hash algorithms and dynamic program
dmax
{xi }i=1 , we can represent their state vectors as {hτi +1 }i=1
dmax
. In algorithms. In the program source code file, besides the source code,
the bi-directed HSU, based on the state vectors of the i t h child node, there also exist a sequence of comments indicating the functions of
we can represent the inferred output vectors for the tokens on the the program, which will be used as the program intention in the
left and on the right as vectors x̂i,l and x̂i,r respectively: experiments. The dataset will be released as a benchmark for code
generation soon.
x̂i,l = softmax(Wl e f t hτi +1 + bl e f t ),
(
4.1.2 Experimental Setup. In the experiments, instead of mod-
x̂i,r = softmax(Wr iдht hτi +1 + br iдht ), eling the program code characters by characters, we propose to
parse the code into ASTs, in which the smallest syntax type is the
where Wl e f t , bl e f t , and Wr iдht , br iдht are the variables used to basic statement in the experiments. From the ASTs, we can sample
project the states to the output in the left and right HSUs respec- a set of sub-tree structured diagrams. The contents attached to
tively. Compared with the ground truth, we can represent the loss the sub-tree nodes together with its structure will be fed to learn
6
Table 1: Next Line Program Code Inference.
Evaluation Metrics
methods Accuracy Mi-Precision Mi-Recall Mi-F1 Ma-Precision Ma-Recall Ma-F1 W-Precision W-Recall W-F1 Train-Iteration
EgoCoder 0.949±0.149 0.949±0.149 0.949±0.149 0.949±0.149 0.932±0.186 0.93±0.188 0.93±0.189 0.954±0.143 0.949±0.149 0.95±0.148 43,000
Ast-BiRnn-LSTM 0.86±0.195 0.86±0.195 0.86±0.195 0.86±0.195 0.803±0.253 0.819±0.237 0.806±0.249 0.847±0.216 0.86±0.195 0.847±0.21 86,000
Ast-BiRnn-GRU 0.858±0.191 0.858±0.191 0.858±0.191 0.858±0.191 0.798±0.249 0.817±0.233 0.802±0.244 0.843±0.213 0.858±0.191 0.844±0.207 86,000
Ast-BiRnn-Basic 0.804±0.212 0.804±0.212 0.804±0.212 0.804±0.212 0.734±0.26 0.741±0.254 0.732±0.259 0.807±0.222 0.804±0.212 0.798±0.22 86,000
Ast-AutoEncoder 0.326±0.175 0.326±0.175 0.326±0.175 0.326±0.175 0.175±0.14 0.222±0.149 0.186±0.139 0.276±0.187 0.326±0.175 0.284±0.174 4,300,000
BiRnn-LSTM 0.774±0.07 0.774±0.07 0.774±0.07 0.774±0.07 0.753±0.126 0.76±0.115 0.756±0.122 0.767±0.087 0.774±0.07 0.769±0.081 430,000
BiRnn-GRU 0.774±0.07 0.774±0.07 0.774±0.07 0.774±0.07 0.753±0.126 0.76±0.115 0.756±0.122 0.767±0.087 0.774±0.07 0.769±0.081 430,000
BiRnn-Basic 0.749±0.08 0.749±0.08 0.749±0.08 0.749±0.08 0.707±0.148 0.71±0.148 0.708±0.148 0.746±0.082 0.749±0.08 0.747±0.081 430,000

Table 2: Program Generation from Comments. Table 3: Program Completion from a Random Input Line.
methods Accuracy methods Accuracy
EgoCoder 35/43 EgoCoder 34/43
Ast-BiRnn-LSTM 13/43 Ast-BiRnn-LSTM 12/43
Ast-BiRnn-GRU 13/43 Ast-BiRnn-GRU 12/43
Ast-BiRnn-Basic 11/43 Ast-BiRnn-Basic 10/43
Ast-AutoEncoder 0/43 Ast-AutoEncoder 0/43

.
n. ft
BiRnn-LSTM 2/43 BiRnn-LSTM 0/43
BiRnn-GRU 1/43 BiRnn-GRU 0/43

io ra
BiRnn-Basic 2/43 BiRnn-Basic 0/43

the EgoCoder model. Based on the program ASTs, we can denote

ut d
[28] models based on the raw program textual information.
the maximum children node number as dmax , and the maximum Model BiRnn-LSTM splits the program code and comment

ib ing
number of tokens attached to each node as k. For the nodes with into tokens based on the space among them.
less than dmax nodes or k tokens, a dummy one-hot key feature • BiRnn-GRU : Model BiRnn-GRU using GRU [9] as the unit
vector representation will be used for padding. At the same time, for model also splits the program code and comment into tokens
str rk
the AST root node, we extract the textual comments of the whole based on the space among them and infers the following
program as its content, which will be represented as a sequence tokens iteratively.
di o
of natural language tokens actually, and can be modeled with the • BiRnn-Basic: Model BiRnn-Basic is of the same architecture
Hsu model effectively as well. Based on the sampled sub-trees, we
or d w

as BiRnn-LSTM and BiRnn-GRU, but it uses the basic neuron


propose to train EgoCoder iteratively as introduced at the end of as the unit model in the learning process.
Section 3 to learn the variables.
4.1.3 Comparison Methods. In the experiments, we will com- 4.1.4 Evaluation Metrics. We formulate the program code gen-
t f he

pare EgoCoder with various existing prediction and generative eration (from comments to program code), program code interpre-
models, which are listed as follows: tation (from program code to program comments), and program
No lis

completion problems as a multi-class classification problem, where


• EgoCoder Model: The EgoCoder model proposed in this
the inferred tokens can be used as labels. In the experiments, we
paper is based on the new Hsu unit model, which can learn
b

will use traditional classification evaluation metric Accuracy, Preci-


the information in program code based on its ASTs.
pu

sion, Recall and F1 for measuring the performance of models. Here,


• Ast-BiRnn-LSTM: The Ast-BiRnn-LSTM model is an AST
the Precision, Recall and F1 metrics cover the micro, macro and
based bi-directional RNN model [28] using LSTM [16] as
weighted versions respectively. In addition, we will also show the
Un

the unit cell. Ast-BiRnn-LSTM can capture the sequential


number of iterations required in training the models as another
patterns in program code textual data in bi-directions simul-
evaluation metric. Here, we need to add a remark that, as indicated
taneously, which is able to infer the tokens ahead of and
in page2 , for the weighted-F1 metric which considers the label im-
after the input.
balance, its value may not be between the corresponding weighted
• Ast-BiRnn-GRU : The Ast-BiRnn-GRU model is also an AST
precision and recall in the experimental results.
based bi-directional RNN [28] model using GRU [9] as the
unit cell. Ast-BiRnn-GRU can capture the sequential pat-
terns in program code textual data in bi-directions as well.
4.2 Experimental Result
• Ast-BiRnn-Basic: In the experiments, we also compare with 4.2.1 Program Generation from Comments. In Table 1, we show
the bi-directional RNN model Ast-BiRnn-Basic [28], which the performance of different methods in generating the program
uses the basic neuron cell as the unit cell. code (line by line). Based on the input program code line, the models
• Ast-AutoEncoder: Via two hidden layers, we propose to will predict the next line of the program code, and the inferred
use the deep Autoencoder model [33] as another baseline tokens in the new line are treated as the labels in evaluation.
method in the experiments. Ast-AutoEncoder can effec- According to the results in Table 1, model EgoCoder can achieve
tively capture the patterns for sequential program compo- much better performance generating the program code line com-
nents extracted from the ASTs. pared with the other baseline methods. For instance, the Accuracy
• BiRnn-LSTM: To show the advantages of modeling the pro- achieved by EgoCoder is 0.949, which is almost the triple of the
gram code textual information based on the AST, we also
compare the methods with the traditional bi-directional RNN 2 http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html

7
Table 4: Program Interpretation with Input Source Code.
Evaluation Metrics
methods Accuracy Mi-Precision Mi-Recall Mi-F1 Ma-Precision Ma-Recall Ma-F1 W-Precision W-Recall W-F1 Train-Iteration
EgoCoder 0.978±0.072 0.978±0.072 0.978±0.072 0.978±0.072 0.963±0.107 0.963±0.105 0.963±0.107 0.978±0.072 0.978±0.072 0.977±0.073 43,000
Ast-BiRnn-LSTM 0.84±0.198 0.84±0.198 0.84±0.198 0.84±0.198 0.779±0.248 0.793±0.241 0.78±0.248 0.832±0.21 0.84±0.198 0.829±0.208 86,000
Ast-BiRnn-GRU 0.849±0.188 0.849±0.188 0.849±0.188 0.849±0.188 0.787±0.241 0.802±0.232 0.789±0.24 0.84±0.202 0.849±0.188 0.837±0.198 86,000
Ast-BiRnn-Basic 0.798±0.212 0.798±0.212 0.798±0.212 0.798±0.212 0.728±0.254 0.731±0.254 0.724±0.257 0.805±0.213 0.798±0.212 0.793±0.215 86,000
Ast-AutoEncoder 0.321±0.182 0.321±0.182 0.321±0.182 0.321±0.182 0.174±0.144 0.214±0.151 0.183±0.143 0.279±0.192 0.321±0.182 0.285±0.182 4,300,000
BiRnn-LSTM 0.782±0.069 0.782±0.069 0.782±0.069 0.782±0.069 0.769±0.121 0.769±0.121 0.769±0.121 0.782±0.069 0.782±0.069 0.782±0.069 430,000
BiRnn-GRU 0.782±0.069 0.782±0.069 0.782±0.069 0.782±0.069 0.769±0.121 0.769±0.121 0.769±0.121 0.782±0.069 0.782±0.069 0.782±0.069 430,000
BiRnn-Basic 0.751±0.135 0.751±0.135 0.751±0.135 0.751±0.135 0.731±0.183 0.731±0.183 0.731±0.183 0.751±0.135 0.751±0.135 0.751±0.135 430,000

def binary_search(seq, key):


from random import randrange

Original from random import randrange


Original Generated
lo = 0

hi = len(seq) - 1
Code Code Code
def partition(seq, left, right, pivot_index):
def partition ( seq , left , right , pivot_index ) :

while hi >= lo:

mid = lo + (hi - lo) // 2


pivot_value = seq[pivot_index]
pivot_value = seq [ pivot_index ]

if seq[mid] < key:


seq[pivot_index], seq[right] = seq[right], seq[pivot_index]
( seq [ pivot_index ] , seq [ right ] ) = ( seq [ right ] , seq [ pivot_index ] )

lo = mid + 1
store_index = left
store_index = left

elif seq[mid] > key:


for i in range(left, right):
for i in range ( left , right ) :

hi = mid - 1
if seq[i] < pivot_value:
if ( seq [ i ] < pivot_value ) :

else:
seq[i], seq[store_index] = seq[store_index], seq[i]
( seq [ i ] , seq [ store_index ] ) = ( seq [ store_index ] , seq [ i ] )

return mid
store_index += 1
store_index += 1

return False seq[store_index], seq[right] = seq[right], seq[store_index]

.
( seq [ store_index ] , seq [ right ] ) = ( seq [ right ] , seq [ store_index ] )

n. ft
return store_index
return store_index

def binary_search ( seq , key ) :Generated



lo = 0
Code

io ra
hi = ( len ( seq ) - 1 )
def sort(seq, left, right):
def sort ( seq , left , right ) :

while ( hi >= lo ) :
if len(seq) <= 1:
if ( len ( seq ) <= 1 ) :

mid = ( lo + ( ( hi - lo ) // 2 ) )
return seq
return seq

ut d
if ( seq [ mid ] < key ) :
elif left < right:
elif ( left < right ) :

lo = ( mid + 1 )
pivot = randrange(left, right)
pivot = randrange ( left , right )

elif ( seq [ mid ] > key ) :


pivot_new_index = partition(seq, left, right, pivot)
pivot_new_index = partition ( seq , left , right , pivot )

ib ing
hi = ( mid - 1 )
sort(seq, left, pivot_new_index - 1)

else :

sort ( seq , left , ( pivot_new_index - 1 ) )

sort(seq, pivot_new_index + 1, right)


sort ( seq , ( pivot_new_index + 1 ) , right )

return mid
return seq
return False return seq
(a) Binary Search Code. (b) Quick Sort Code.
str rk
Figure 5: Succeeded Examples of EgoCoder. Left: Binary Search; Right: Quick Sort.
di o
Accuracy achieved by Ast-AutoEncoder and also surpasses Ast- Among the baseline methods, EgoCoder can still achieve much
or d w

BiRnn-LSTM, Ast-BiRnn-GRU and Ast-BiRnn-Basic by more better results than the baseline methods with great advantages.
than 10% and outperforms BiRnn-LSTM, BiRnn-GRU and BiRnn- Among all the program code input, EgoCoder can correctly gen-
Basic by more than 22.6%. Similar results can also be observed for erate about 0.978 the program comment tokens. The AST-BiRNN
the other evaluation metrics. In addition, model EgoCoder takes methods can also achieve a very good performance, and they can
t f he

far less iterations before convergence based on the training set. As obtain an average Accuracy around 0.8, which is better than the
shown in Table 1, the iteration required for EgoCoder to converge traditional BiRNN methods. Method Ast-AutoEncoder performs
No lis

1 of the required iterations


is merely about 43, 000, which is about 100 the worst in the program interpretation task, which can merely
1
by Ast-AutoEncoder, 2 of the required iterations by Ast-BiRnn- achieve an average Accuracy score around 0.321.
b

LSTM, Ast-BiRnn-GRU, Ast-BiRnn-Basic, and about 10 1 of the


4.2.3 Program Completion with Code Fragments. Table 3 covers
pu

required iterations by BiRnn-LSTM, BiRnn-GRU and BiRnn-Basic.


the program code completion experimental results of the compari-
In Table 2, we show the results obtained by EgoCoder in gen-
son methods. Here, for each program in the dataset, we randomly
erating the complete program code based on the input program
Un

pick one line in the program as the input for the models to complete
comments. Here, the generation process involves the iterative in-
the program code (i.e., generate the code ahead of or after the input
ferences of the program tokens at the next lines based on the input
line). Slightly different from the program generation as shown in
program comments without any interactions with the outside world.
Table 2, where the input is the tree root node content, the input
Among the 43 input program comments, EgoCoder is able to gener-
in the program completion can be any lines in the program code,
ate 35 of them without making any mistakes, which outperform the
results obtained in which are slightly lower than those in Table 2.
baseline methods with great advantages. For the AST-BiRNN meth-
Among these 43 programs in the dataset, EgoCoder completes
ods, they can generate 11-13 of the program without any mistakes.
34 of the correctly, which is much better than the other baseline
For the traditional BiRNN methods, they can only generate 1-2
methods. The AST-BiRNN methods can still complete about 10
programs, while Ast-AutoEncoder cannot generate any program
of the programs correctly, while the remaining methods cannot
at all.
complete the program code at all. The program completion task
may require the model to be able to generate contents in both the
4.2.2 Program Interpretation based on Code. In Table 4, we pro-
sequential directions and the hierarchical directions, i.e., ahead of
vide the experimental results of the comparison methods in gen-
the input, after the input, above the input and below the input,
erating the program comments based on the program code input.
which can demonstrate the advantages of the Hsu model compared
Compared with the program code, the program comments are of a
against the traditional RNN models.
shorter length and have less tokens to be predicted, and the evalu-
ation scores achieved by the baseline methods in Table 4 are also 4.2.4 Experimental Discoveries. Generally, according to the pro-
slightly larger than the scores in Table 1. gram generation, program interpretation and program completion
8
from math import sqrt
class UnionFindWithPathCompression:
# continued, the second part

def atkin(limit):
elif self.__rank[x_root] > self.__rank[y_root]:

def __init__(self, N):

if limit == 2:
self.__parent[y_root] = x_root

if type(N) != int:

return [2]
5 else:

if limit == 3:

raise TypeError("size_must_be_integer")

if N < 0:
self.__parent[y_root] = x_root

return [2, 3]

raise ValueError("N_cannot_be_a_negative_integer")
self.__rank[x_root] = self.__rank[x_root] + 1

if limit == 5:

self.__parent = []

return [2, 3, 5]

self.__rank = []
def __find(self, x):

if limit < 2:

return []
self.__N = N
if self.__parent[x] != x:

primes = [2, 3, 5]
for i in range(0, N):
self.__parent[x] = self.__find(self.__parent[x])

is_prime = [False] * (limit + 1)


self.__parent.append(i)
1 return self.__parent[x]

sqrt_limit = int(sqrt(limit)) + 1
self.__rank.append(0)

def find(self, x):

for x in range(1, sqrt_limit):


def make_set(self, x):
2 self.__validate_ele(x)

for y in range(1, sqrt_limit):


if type(x) != int:
if self.__parent[x] == x:

raise TypeError("x_must_be_integer")
return x

6
n = 4 * x ** 2 + y ** 2

if n <= limit and (n % 12 == 1 or n % 12 == 5):


if x != self.__N:
3 else:

is_prime[n] = not is_prime[n]


raise ValueError("index_{0}".format(self.__N))
return self.find(self.__parent[x])

n = 3 * x ** 2 + y ** 2
self.__parent.append(x)

if n <= limit and (n % 12 == 7):


self.__rank.append(0)
def is_connected(self, x, y):

1 is_prime[n] = not is_prime[n]


self.__N = self.__N + 1
self.__validate_ele(x)

n = 3 * x ** 2 - y ** 2
self.__validate_ele(y)

if x > y and (n <= limit) and (n % 12 == 11):


def union(self, x, y):
return self.find(x) == self.find(y)

is_prime[n] = not is_prime[n]


self.__validate_ele(x)
4
self.__validate_ele(y)
def parent(self, x):

for index in range(5, sqrt_limit):


x_root = self.__find(x)
return self.__parent[x]

.
if is_prime[index]:

n. ft
y_root = self.__find(y)

for composite in range(index ** 2, limit, index ** 2):

2 if x_root == y_root:
def __validate_ele(self, x):

is_prime[composite] = False

return
if type(x) != int:

io ra
for index in range(7, limit):

if self.__rank[x_root] < self.__rank[y_root]:


raise TypeError("{0}_is_not_an_integer".format(x))

if is_prime[index]:

self.__parent[x_root] = y_root
if x < 0 or x >= self.__N:

primes.append(index)

raise ValueError("{0}_is_not_in_[0,{1})".format(x, self.__N))

ut d
return primes
# the first part
(a) Sieve of Atkin Code. (b) Union Find Class Code.

ib ing
Figure 6: Examples that EgoCoder Fail to Handle. Left: Sieve of Atkin Algorithm; Right: Union Find Class Class. (The dupli-
cated contents are highlighted in colored bolded font).
str rk
experimental results, EgoCoder performs very well in inferring 4.3.2 Failed Cases. Besides the succeeded cases aforementioned,
both the contents at both the children nodes and father node. The in Figure 6, we also provide two cases that EgoCoder cannot han-
di o
main reason can be due to that the Hsu model effectively cap- dle well, especially in program code generation and completion.
or d w

tures both the sequential and hierarchical information patterns in The left program code is about the Sieve of Atkin algorithm, and
the ASTs, which performs much more effectively than the models the right code is about the Union Find class with compressed path.
merely capturing the sequential patterns. In addition, the AST will The main problem with the code is that it contains so many du-
also greatly improve the model performance, since the tree diagram plicated contents. In the blocks, we can identify several common
t f he

will effectively help the models outline the program hierarchical statements (in the same colors), which will make the EgoCoder
structure. The Ast-AutoEncoder model cannot achieve a good fail to work. For instance, in the Union Find class code, given a
No lis

performance in these content generation tasks. statement “self.__validate_ele(y)” (in bolded blue font), it will be
very hard for the model to generate the statement after it, since
b

4.3 Case Study there are two different options “x_root = self.__find(x)” and “re-
turn self.find(x) == self.find(y)”. Such a problem can be hopefully
pu

In this part, we will provide a study about the succeeded and failed
addressed by incorporate a even deeper architecture in EgoCoder.
cases of EgoCoder in the experiments.
Just like this failed case, if we can effectively incorporate the father
Un

node of “self.__validate_ele(y)” (i.e., “def union(self, x, y):” and “def


4.3.1 Succeeded Cases. In Figures 5, we show the program code
is_connected(self, x, y):”), the conflict can be resolved promisingly.
that EgoCoder can handle very well in both generation, interpre-
We will leave it as a potential future work.
tation and completion. The left program code is about the Binary
Search algorithm and the right code is about the Quick Sort algo-
5 RELATED WORK
rithm. We show both the original and the generated program code
of both algorithms in the plots. By checking these two program The problem studied in this paper is strongly correlated with re-
code blocks, we can observe that the code generated by EgoCoder search problems about deep neural network, text generation and
can implement exactly the same function as the original program program synthesis.
code in the dataset. Furthermore, we can also observe some differ- Deep Neural Networks: The essence of deep learning is to com-
ences between the program code blocks: (1) many of the operator pute hierarchical features or representations of the observational
and variable tokens in the original code blocks are connected, while data [12, 22]. With the surge of deep learning research and applica-
the tokens in the generated code block are well separated; (2) in tions in recent years, lots of research works have appeared to apply
the generated code block, some extra parentheses are inserted, es- the deep learning methods, like deep belief network [15], deep
pecially for some expressions in statements; (3) the code indent in Boltzmann machine [27], Deep neural network [17, 20] and Deep
the original code uses two space keys, but in the generated code autoencoder model [33], in various applications, like speech and
involves 1 tab key instead. These differences are mainly due to our audio processing [11, 14], language modeling and processing [6, 25],
model EgoCoder is trained based on the AST, whose contents are information retrieval [13, 27], objective recognition and computer
well organized and structured by the program parser. vision [22], as well as multimodal and multi-task learning [35, 36].
9
Text Generation: Text generation has been an important problem [3] The labor market supply & demand of software devel-
in both text mining and natural language processing. Depending on opers. http://www.economicmodeling.com/2017/06/01/
labor-market-supply-demand-software-developers/. [Online; accessed
the input information, the text generation problem can be catego- 2-December-2017].
rized into text generation from keywords [32], concepts [19], topics [4] Why google stores billions of lines of code in a sin-
gle repository. https://cacm.acm.org/magazines/2016/7/
[8], ontologies [7] and images [34]. In [32], the authors propose a 204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/
method consisting of candidate-text construction and evaluation for fulltext. [Online; accessed 2-December-2017].
[5] A. Albarghouthi, S. Gulwani, and Z. Kincaid. Recursive program synthesis. In
sentence generation from keywords/headwords. Konstas et al. [19] CAV, 2013.
introduced a global model for concept-to-text generation, which [6] E. Arisoy, T. Sainath, B. Kingsbury, and B. Ramabhadran. Deep neural network
refers to the task of automatically producing textual output from language models. In WLM, 2012.
[7] K. Bontcheva. Generating tailored textual summaries from ontologies, 2005.
non-linguistic input. In terms of the objective output, the text gen- [8] Y. Chali and S. Hasan. Towards topic-to-question generation. Comput. Linguist.,
eration problem includes question generation [8], image captions 2015.
[34] and image descriptions [21]. Various models have been used [9] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated
recurrent neural networks on sequence modeling. 2014.
in the text generation problems, including RNN [31], Autoencoder [10] R. Cochran, L. DAntoni, B. Livshits, D. Molnar, and M. Veanes. Program boosting:
[23] and GAN [37]. Program synthesis via crowd-sourcing. In POPL, 2015.
[11] L. Deng, G. Hinton, and B. Kingsbury. New types of deep neural network learning

.
Program Synthesis: The problem studied in this paper is also for speech recognition and related applications: An overview. In ICASSP, 2013.

n. ft
[12] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.
closely related with the program synthesis problem studied in soft- http://www.deeplearningbook.org.

io ra
ware engineering. Formally, the goal of software program synthesis [13] S. Hill. Elite and upper-class families. In Families: A Social Class Perspective. 2012.
is to generate programs automatically from high-level specifica- [14] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke,
P. Nguyen, T. Sainath, and B. Kingsbury. Deep neural networks for acoustic

ut d
tions, lots of research works have been done on this topic already. modeling in speech recognition. IEEE Signal Processing Magazine, 2012.
Program synthesis is a challenging problem, which may require [15] G. Hinton, S. Osindero, and Y. Teh. A fast learning algorithm for deep belief nets.

ib ing
external supervisions from either template [30], examples and type Neural Comput., 2006.
[16] S. Hochreiter and J Schmidhuber. Long short-term memory. Neural Comput.,
information [26], and oracles [18]. In [18], the authors present a 1997.
novel approach to automatic synthesis of loop-free programs based [17] H. Jaeger. Tutorial on training recurrent neural networks, covering BPPT, RTRL,
str rk EKF and the “echo state network” approach. Technical report, Fraunhofer Institute
on a combination of oracle-guided learning from examples. Based for Autonomous Intelligent Systems (AIS), 2002.
on the program templates, [30] introduces an approach to generate [18] S. Jha, S. Gulwani, S. Seshia, and A. Tiwari. Oracle-guided component-based
di o
the programs from the templates. Osera et al. [26] introduce an al- program synthesis. In ICSE, 2010.
[19] I. Konstas and M. Lapata. A global model for concept-to-text generation. J. Artif.
or d w

gorithm for synthesizing recursive functions that process algebraic Int. Res., 2013.
datatypes, which exploits both type information and input-output [20] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep
examples to prune the search space. Some other program synthesis convolutional neural networks. In NIPS, 2012.
[21] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. Berg, and T. Berg. Baby talk:
works address the problem with recursive algorithms [5], deductive
t f he

Understanding and generating simple image descriptions. In CVPR, 2011.


approach [24], crowd-sourcing [10], and program verification [29]. [22] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521, 2015. http:
//dx.doi.org/10.1038/nature14539.
[23] J. Li, M. Luong, and D. Jurafsky. A hierarchical neural autoencoder for paragraphs
No lis

6 CONCLUSION and documents. In ACL, 2015.


[24] Z. Manna and R. Waldinger. A deductive approach to program synthesis. ACM
In this paper, we have studied a novel research problem about Trans. Program. Lang. Syst., 1980.
b

program generation, which covers the tasks about program code [25] A. Mnih and G. Hinton. A scalable hierarchical distributed language model. In
pu

generation, program interpretation and program completion. To ad- NIPS. 2009.


[26] P. Osera and S. Zdancewic. Type-and-example-directed program synthesis. In
dress such a challenging problem, a new learning model, namely PLDI, 2015.
EgoCoder, has been introduced. EgoCoder learns the program
Un

[27] R. Salakhutdinov and G. Hinton. Semantic hashing. International Journal of


code contents by parsing the program code into ASTs, whose nodes Approximate Reasoning, 2009.
[28] M. Schuster and K.K. Paliwal. Bidirectional recurrent neural networks. Trans.
contain the program code/comment contents while the AST struc- Sig. Proc., 1997.
ture can indicate the program logic flows. To effectively capture [29] S. Srivastava, S. Gulwani, and J. Foster. From program verification to program
synthesis. In POPL, 2010.
both the hierarchical and sequential patterns in the ASTs, a new [30] S. Srivastava, S. Gulwani, and J. Foster. Template-based program verification
unit model, i.e., Hsu, is introduced as the basic structure covered in and program synthesis. International Journal on Software Tools for Technology
EgoCoder. We have tested the effectiveness of EgoCoder on real- Transfer, 2013.
[31] I. Sutskever, J. Martens, and G. Hinton. Generating text with recurrent neural
world program datasets, and EgoCoder can achieve very outstand- networks. In ICML, 2011.
ing performance than the other state-of-the-art baseline methods [32] K. Uchimoto, H. Isahara, and S. Sekine. Text generation from keywords. In
in addressing the program generation tasks. COLING, 2002.
[33] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. Manzagol. Stacked denoising
autoencoders: Learning useful representations in a deep network with a local
REFERENCES denoising criterion. Journal of Machine Learning Research, 2010.
[1] Codebases: Millions of lines of code. http://www.informationisbeautiful.net/ [34] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image
visualizations/million-lines-of-code/. [Online; accessed 2-December-2017]. caption generator, 2014.
[2] Engineering software market to reach $50.34 bn in 2022. [35] J. Weston, S. Bengio, and N. Usunier. Large scale image annotation: Learning to
https://www.automation.com/automation-news/industry/ rank with joint word-image embeddings. Journal of Machine Learning, 2010.
engineering-software-market-to-reach-5034-bn-in-2022. [Online; accessed [36] J. Weston, S. Bengio, and N. Usunier. Wsabie: Scaling up to large vocabulary
2-December-2017]. image annotation. In IJCAI, 2011.
[37] Y. Zhang, Z. Gan, K. Fan, Z. Chen, R. Henao, D. Shen, and L. Carin. Adversarial
feature matching for text generation. arXiv preprint arXiv:1706.03850, 2017.

10

You might also like