AST-Trans- Code Summarization with Efficient Tree-Structured
AST-Trans- Code Summarization with Efficient Tree-Structured
150
Authorized licensed use limited to: China University of Petroleum. Downloaded on April 11,2023 at 12:23:20 UTC from IEEE Xplore. Restrictions apply.
ICSE ’22, May 21–29, 2022, Pittsburgh, PA, USA Ze Tang, Xiaoyu Shen, Chuanyi Li, Jidong Ge, Liguo Huang, and Zhelin Zhu, Bin Luo
return 0 if x<0, else return x by the two relationship matrices to better model the dependency.
itself.
We further describe several implementations of the proposed AST-
code summary
Trans and have a comprehensive analysis of their computational
Return AST complexity. In short, the contributions of this paper are as below:
IfExp • We propose AST-Trans that can efficiently encode long AST
Compare body orelse sequences with linear complexity, in contrast with the qua-
dratic complexity of the standard Transformer.
NameLoad(x) Lt constant(0) constant(0) NameLoad(x)
• We perform a comprehensive analysis, with both theoretical
and empirical evidences, on the computational complexity
Figure 1: Example of code-AST-summary triples. We mainly need of different implementations.
to understand the ancestor-descendent and sibling relationships in • We validate our proposed model on two datasets of Java and
the AST to generate a summary. Python. Experimental results show that AST-Trans outper-
forms the state-of-the-arts by a substantial margin.
• We compare representative methods for AST encoding and
neural networks start to raise more and more attention [20, 37– discuss their pros and cons.
39, 56]. Current state-of-the-arts all follow the Transformer-based
Paper Organization The remainder of this paper is organized
encoder-decoder architecture [5, 8, 45, 48, 49] and can be trained
as follows. Section 2 presents background knowledge on the Trans-
end-to-end with code-summary pairs. Since the source code is
former and AST. Section 3 elaborates on the details of AST-Trans,
highly structured and follows strict programming language gram-
section 4 presents its different implementation and the complexity
mars, a common practice is to also leverage the Abstract Syntax
is analyzed in section 5. Section 6 explains the experimental setup
Tree (AST) to help the encoder digest the structured information.
and analyzes the results. Section 7 discusses threats to validity. Sec-
The AST is usually linearized by different algorithms like pre-order
tion 8 surveys the related work. Finally, section 9 concludes the
traversal [21], structure-based traversal (SBT) [18] and path decom-
paper and points out future research directions.
position [4], then fed into the encoder. Several works also proposed
architectures specific for tree encoding like tree-LSTM [11, 51].
However, the linearized ASTs, as containing additional struc- 2 BACKGROUND
tured information, are much longer than their corresponding source Transformer. The Transformer architecture was initially proposed
code sequence. Some linearization algorithms can further increase for neural machine translation [49]. It consists of multi-head stacked
the length. For example, linearizing with SBT usually makes the encoder and decoder layers. In each encoder stack, the inputs first
size times longer. This makes the model extremely difficult to accu- flow through a self-attention sublayer, and then are fed into a
rately detect useful dependency relations from the overlong input position-wise feed-forward network followed by a layer normaliza-
sequence 2 . Moreover, it brings significant computational overhead, tion. The decoder has a set of the cross-attention layers to help the
especially for state-of-the-art Transformer-based models where decoder focus on relevant parts of the input sequence. The Trans-
the number of self-attention operations grows quadratically with former architecture removes the recurrence mechanism in favor of
the sequence length. Encoding ASTs with tree-based models like the self-attention. As each word in a sentence simultaneously flows
tree-LSTM will incur extra complexity because it needs to traverse through the encoder and decoder stack, the model itself does not
the whole tree to obtain the state of each node. have any sense of the word order. Therefore, a position embedding
In this work, we assume that the state of a node in the AST is is added to each word embedding to inform the order information.
affected most by its (1) ancestor-descendent nodes, which represent Abstract Syntax Tree (AST). An Abstract Syntax Tree (AST)
the hierarchical relationship across different blocks, and (2) sibling uniquely represents a source code snippet in a given language
nodes, which represent the temporal relationship within one block. and grammar [4]. The leaves of the tree are terminals, usually re-
We show an example of code summarization in Figure 1. As can be ferring to variables, types and method names. The non-leaf nodes
seen, we need the ancestor-descendent relationship to understand are non-terminals and represent a restricted set of structures in the
the high-level procedure, and the sibling relationship to understand programming language, e.g., loops, expressions, and variable decla-
the low-level details within a block. Capturing these two relation- rations. For example, in Figure 1, variables (such as NameLoad(x))
ships are enough for producing the summary and modelling the are represented as terminals of AST. Syntactic structures (such as
full attention among all nodes is unnecessary. Compare) are represented as non-terminals. Since the variable and
Based on this intuition, we propose AST-Trans, a simple variant method names can be rather freely defined, directly processing the
of the Transformer model to efficiently handle the tree-structured source code can be challenging. Its corresponding AST, due to its
AST. AST-Trans exploits ancestor-descendant and sibling relation- strict structure, often serves as substitute when encoding the source
ship matrices to represent the tree-structure, and uses these ma- code.
trices to dynamically exclude irrelevant nodes in different self-
attention layers. The absolute position embedding from the original 3 AST-TRANS
Transformer is replaced with relative position embeddings defined
This section details our proposed AST-Trans. For an AST, it will
2 Indeed,encoding the overlong AST with SBT even underperforms directly encoding be firstly linearized into a sequence. Then the ancestor-descendent
the source code when using Transformer with relative position embeddings [1]. and sibling relationships among its nodes will be denoted through
151
Authorized licensed use limited to: China University of Petroleum. Downloaded on April 11,2023 at 12:23:20 UTC from IEEE Xplore. Restrictions apply.
AST-Trans: Code Summarization with Efficient Tree-Structured Attention ICSE ’22, May 21–29, 2022, Pittsburgh, PA, USA
Table 1: Linearized AST of the tree in Fig 1 with POT,SBT and PD.
Methods Linearized AST sequence
Return IfExp Compare NameLoad(x) Lt constant(0) body constant(0) orelse
POT
NameLoad(x)
( Return ( IfExp ( Compare ( constant(0) ) constant(0) ( Lt ) Lt ( NameLoad(x)
SBT ) NameLoad(x) ) Compare ( body ( constant(0) ) constant(0) ) body ( orelse Ğ Ğ
( NameLoad(x) ) NameLoad(x) ) orelse ) IfExp ) Return Ğ Ğ
Ğ Ğ
Path1: Path1: Lt Compare constant(0)
Path2: NameLoad(x) Compare constant(0)
PD
Path3: Path3: constant(0) Compare IfExp body constant(0)
... Figure 2: Example of generating position matrices for ancestor-
descendent (A) and sibling relationship (S). Position matrix gener-
ated from the linear relationship is used in standard Transformers.
two specific matrices. Based on the matrices, we replace the stan-
dard self-attention with tree-structured attention to better model
these two relationships. Irrelevant nodes are dynamically ruled
out to reduce computational cost. We will first introduce different Specifically, two nodes have the ancestor-descendant relationship if
linearization methods (section 3.1), then explain the construction there exists a directed path from root node that can traverse through
of two relationship matrices (section 3.2), and finally present the them. Two nodes have the sibling relationship if they share the
tree-structure attention to utilize the matrices(section 3.3). same parent node.
We use two position matrices 𝐴𝑁 ×𝑁 and 𝑆 𝑁 ×𝑁 to represent
3.1 AST Linearization the ancestor-descendent and sibling relationships respectively. 𝑁
is the total number of nodes in AST. We denote the 𝑖th node in
In order to encode the tree-shaped AST, it first needs to be converted
the linearized AST as 𝑛𝑖 . 𝐴𝑖 𝑗 is the distance of the shortest path
into a sequence with a linearization method. There are the three
between 𝑛𝑖 and 𝑛 𝑗 in the AST. 𝑆𝑖 𝑗 is horizontal sibling distance
most representative linearization methods used in current works:
between 𝑛𝑖 and 𝑛 𝑗 in the AST if they satisfy the sibling relationship.
(1) Pre-order Traversal (POT): It visits the tree nodes with pre- If one relationship is not satisfied, its value in the matrix will be
order traversal. Sequences obtained by pre-order traversal infinity. Note that we consider the relative relationship between two
are lossy since the original ASTs cannot be unambiguously nodes, which means 𝐴𝑖 𝑗 = −𝐴 𝑗𝑖 and 𝑆𝑖 𝑗 = −𝑆 𝑗𝑖 if a relationship
reconstructed back from them. exists between 𝑛𝑖 and 𝑛 𝑗 .
(2) Structure-based Traversal (SBT): It adds additional brack- Formally, we use SPD(𝑖, 𝑗) and SID(𝑖, 𝑗) to denote the Shorted
ets [18] to indicate the parental-descendent relationship such Path Distance and horizontal SIbling Distance between 𝑛𝑖 and 𝑛 𝑗
that each sequence can be unambiguously mapped back to in the AST. The values in the relationship matrices are defined as:
the AST, but it also doubles the size of the linearized se-
quence.
(3) Path Decomposition (PD): It represents the AST by concate- SPD(𝑖, 𝑗) if |SPD(𝑖, 𝑗)| ≤ 𝑃
𝐴𝑖 𝑗 =
nating the path between two random leaf nodes. The total ∞ otherwise
(1)
number of paths can be too large for computing and there- SID(𝑖, 𝑗) if |SID(𝑖, 𝑗)| ≤ 𝑃
𝑆𝑖 𝑗 =
fore random sampling is needed [4]. ∞ otherwise
Table 1 shows the AST in Figure 1 linearized with the above
three different methods. For POT and SBT, the linearized trees 𝑃 is a pre-defined threshold and nodes with relative distance
can be directly fed into the encoder. For PD, the average total beyond 𝑃 will be ignored. We hypothesize that precise relative dis-
number of paths can be over 200, concatenating them all to train tance is not useful beyond a certain range. It can both constrain the
is infeasible [4]. In practice, mean pooling is run over the states computation complexity within a constant range and save memory
of each path such that each path has one unique representation. space for storing the relative position embeddings. Figure 2 shows
The decoder only attends to these unique representations of paths an example of generating matrix 𝐴 and 𝑆, in comparison with the
instead of specific nodes within paths. This can affect the model position matrix generated from a linear relationship, which is used
when copying user-defined names (in leaf nodes) is needed. in standard Transformers. In the next section, we will introduce
We adopt the simplest POT linearization for our model. We how to use these two matrices to dynamically incorporate such
show that it has already achieved SOTA results and more complex relationship information through a tree-structured attention.
linearization methods like SBT do not help. PD does not apply to our
model since it treats one path as a whole. We will show in section 6.3 3.3 Tree-Structured Attention
that this leads to poor performance in code summarization. Tree-structured attention is built on the standard self-attention
with relative position embeddings and disentangled attention. It
3.2 Relationship Matrices replaces the relative position embeddings derived from the linear
We define two kinds of relationships between nodes in the tree that relationship into the two matrices derived from the tree structure.
we care about: ancestor-descendant (𝐴) and sibling (𝑆) relationships. Self-Attention. Standard self-attention transforms the input
The former represents the hierarchical information across blocks, sequence x = (𝑥 1, . . . , 𝑥𝑛 ) (𝑥𝑖 ∈ R𝑑 which stands for the embedding
and the latter represents the temporal information within one block. of 𝑛𝑖 ) into a sequence of output vectors o = (𝑜 1, . . . , 𝑜𝑛 ) (𝑜𝑖 ∈ R𝑑 ).
152
Authorized licensed use limited to: China University of Petroleum. Downloaded on April 11,2023 at 12:23:20 UTC from IEEE Xplore. Restrictions apply.
ICSE ’22, May 21–29, 2022, Pittsburgh, PA, USA Ze Tang, Xiaoyu Shen, Chuanyi Li, Jidong Ge, Liguo Huang, and Zhelin Zhu, Bin Luo
The single-head self-attention [49] can be formulated as: that it will not add any additional parameter on top of the standard
𝑸 (𝑥𝑖 )𝑲 (𝑥 𝑗 ) Transformer. ℎ𝐴 heads will use 𝛿𝐴 (𝑖, 𝑗) and the rest ℎ𝑆 heads will
𝜶 𝒊𝒋 = √ use 𝛿𝑆 (𝑖, 𝑗). Information from the two relationships will be merged
𝑑 together through multi-head attention. We then replace 𝛿 (𝑖, 𝑗) in
𝑛
(2)
Eq 4 with 𝛿𝑅 (𝑖, 𝑗) in Formula 5, and apply a scaling factor of √1 on
𝑜𝑖 = 𝜎 (𝜶 𝒊𝒋 )𝑽 (𝑥 𝑗 ) 3𝑑
𝑗=1 𝛼˜𝑖,𝑗 (because it has 3 items). The final output vector is computed as
in Eq (6), where 𝑽 𝑷 represents the value project matrix of relative
where 𝑸, 𝑲 : R𝑑 → R𝑚 are query and key functions respectively, distances and 𝑽𝑹𝑷 is the 𝑅𝑖 𝑗 -th row of 𝑽 𝑷 .
𝑽 : R𝑑 → R𝑑 is a value function, 𝜎 is a scoring function (e.g. 𝒊𝒋
153
Authorized licensed use limited to: China University of Petroleum. Downloaded on April 11,2023 at 12:23:20 UTC from IEEE Xplore. Restrictions apply.
AST-Trans: Code Summarization with Efficient Tree-Structured Attention ICSE ’22, May 21–29, 2022, Pittsburgh, PA, USA
There are 3 benefits of this approach compared with GC:
• 𝐾 𝑃 and 𝑄 𝑃 can be reused, as each 𝑄𝑟𝑜𝑤𝑠 and 𝐾𝑟𝑜𝑤𝑠 have the
same relative distance 𝑠. The position embeddings of 𝑠 can be
directly added into the content without gather operations.
• Only a quarter of number of gather operation is needed
(discussed in 5.3).
• Only one dot production is required, as the second 𝑄𝑠𝑃 𝐾𝑠𝑃
can be reused and only (𝑄𝑟𝑜𝑤𝑠 + 𝑄𝑠𝑃 ) (𝐾𝑟𝑜𝑤𝑠 + 𝐾𝑠𝑃 ) needs
to be calculated.
See Appendix A for the complete algorithm.
5 COMPLEXITY ANALYSIS
Figure 3: Decompose the relative distance matrix 𝛿𝑅 of the tree
“abcd" with max relative distance 𝑃 = 1. In this section, we will discuss the best, worst and average complex-
ity of 5 implementations mentioned above. We use FLOPs (floating
point operations) to measure the computational complexity. The
element indices and the corresponding values. Let 𝐶𝑂𝑂𝑟𝑜𝑤 /𝐶𝑂𝑂𝑐𝑜𝑙 considered operations includes: matrix multiplication, matrix dot
denotes the list of row/column indexes, and 𝐶𝑂𝑂 𝑣𝑎𝑙 denotes the production, add and gather operation which are the main operations
list of values in the COO format of 𝛿𝑅 . We then use them as indexes involved for the attention computation. FLOPs of these operations
to gather the query and key of content as: are listed below:
𝑄𝑟𝑜𝑤 = 𝑄 (𝑥) [𝐶𝑂𝑂𝑟𝑜𝑤 ; :]; 𝐾𝑐𝑜𝑙 = 𝐾 (𝑥) [𝐶𝑂𝑂𝑐𝑜𝑙 ; :] 𝐹 𝐿𝑂𝑃𝑠 (𝐴 + 𝐵) = 𝑁 (𝑚 − 1); 𝐹 𝐿𝑂𝑃𝑆 (𝐴[𝐶; :]) = |𝐶 | ∗ 𝑚
𝑃
𝑄 𝑣𝑎𝑙 = 𝑄 𝑃 [𝐶𝑂𝑂 𝑣𝑎𝑙 ; :]; 𝐾𝑣𝑎𝑙
𝑃
= 𝐾 𝑃 [𝐶𝑂𝑂 𝑣𝑎𝑙 ; :] 𝐹 𝐿𝑂𝑃𝑠 (𝐴 𝐵) = 𝑁𝑚 2 + 𝑁 (𝑚 − 1) (7)
By this way, each column in the query content 𝑄𝑟𝑜𝑤 corresponds to 𝐹 𝐿𝑂𝑃𝑠 (𝐴 × 𝐵 ) = 𝑁 ∗ 𝐹 𝐿𝑂𝑃𝑠 (𝐴 𝐵)
the same column in the key content 𝐾𝑐𝑜𝑙 . Then we can use matrix where 𝐴 and 𝐵 are two matrices with shape [𝑁 , 𝑚], 𝐴[𝐶; :] indicates
dot production to compute attention scores: gather 𝐴 with 𝐶 as the index, |𝐶 | is the number of elements in 𝐶.
𝛼𝑐𝑜𝑜 = 𝑄𝑟𝑜𝑤 𝐾𝑐𝑜𝑙 + 𝑄𝑟𝑜𝑤 𝐾𝑣𝑎𝑙 𝑃 𝑃
+ 𝑄 𝑣𝑎𝑙 𝐾𝑐𝑜𝑙 We will focus our analysis on attention heads using the ancestor-
descendent relationship (𝐴), but similar ideas can be used to analyze
where indicates dot production. 𝛼𝑐𝑜𝑜 is a vector and corresponds the sibling relationship (𝑆) straightforwardly. As the complexity is
to the non-zero values in 𝛼˜ (Eq. 4), and 𝛼˜ [𝐶𝑂𝑂𝑟𝑜𝑤 [𝑖]; 𝐶𝑂𝑂𝑐𝑜𝑙 [𝑖]] = related to the number of non-zero elements in 𝛿𝐴 (denoted with
𝛼𝑐𝑜𝑜 [𝑖]. The content-to-position or position-to-content can be com- |𝛿𝐴 > 0|). We first analyze the range of |𝛿𝐴 > 0|, then present the
puted the same as in Sparse, and the total number of gather opera- complexity of each implementation.
tions in attention computation is 4 times of non-zero elements in
𝛿𝑅 : 2 for gathering the content and 2 for gathering the position. 5.1 Range of |𝛿𝐴 > 0|
Gather with decomposed COO (GDC). To reduce the number Theorem 5.1. For any directed tree 𝑇 , let E(i) represent the number
of gather operations in GC, we can add a matrix decomposition of paths in 𝑇 with length 𝑖, 𝐿 represent the length of the longest path
operation on top of it. First, we decompose 𝛿𝑅 by 𝐶𝑂𝑂 𝑣𝑎𝑙 such that in 𝐺, we have:
each sub-matrix 𝛿𝑅𝑠 contains only node-pairs with the same relative
𝐸 (1) > 𝐸 (2) > · · · > 𝐸 (𝐿)
distance 𝑠. An example is shown in Figure 3, where the original 𝛿𝑅
contains 3 distinct values and we decompose it into 3 sub-matrices Proof. Assuming there are 𝑁 nodes in the tree, and the root
accordingly. We transfer each sub-matrix 𝛿𝑅𝑠 into its COO format node is at level 1. Define 𝑁 𝑗 as the number of nodes at level 𝑗. For
and use 𝐶𝑂𝑂 𝑠 to indicates the sub-matrix with 𝑣𝑎𝑙 = 𝑠. For each each node at level 𝑗, if 𝑗 − 𝑖 > 0, there exists one path of length
sub-matrix 𝐶𝑂𝑂 𝑠 , we gather content embeddings of nodes by: 𝑖 ending with this node, otherwise no such path exists. Hence,
𝑠
𝑄𝑟𝑜𝑤𝑠 = 𝑄 (𝑥) [𝐶𝑂𝑂𝑟𝑜𝑤 𝑠
; :], 𝐾𝑐𝑜𝑙𝑠 = 𝐾 (𝑥) [𝐶𝑂𝑂𝑐𝑜𝑙 ; :] 𝐸 (𝑖) = 𝑁 − 𝑖𝑗=1 𝑁 𝑗 and 𝑁 𝑗 > 0. Therefore we must have 𝐸 (𝑖) >
𝐸 (𝑖 + 1).
where 𝑄𝑟𝑜𝑤𝑠 indicates the query content ordered by 𝐶𝑂𝑂𝑟𝑜𝑤𝑠 , and
𝑠
𝐾𝑐𝑜𝑙𝑠 represents the key content ordered by 𝐶𝑂𝑂𝑐𝑜𝑙 . The attention Theorem 5.2. Every tree with 𝑁 nodes has exactly 𝑁 − 1 edges.
scores can then be computed as:
Proof. Imagine starting with 𝑁 isolated nodes and adding edges
𝛼𝑐𝑜𝑜𝑠 = (𝑄𝑟𝑜𝑤𝑠 + 𝑄𝑠𝑃 ) (𝐾𝑟𝑜𝑤𝑠 + 𝐾𝑠𝑃 ) − (𝑄𝑠𝑃 𝐾𝑠𝑃 ) one at a time. By adding one edge, we will either (1) connect two
components together, or (2) close a circuit. Since a tree is fully
where 𝛼𝑐𝑜𝑜𝑠 corresponds to the attention scores of node pairs in
connected and has no circuit, we must add exactly 𝑁 − 1 edges.
𝛿𝑅𝑠 . Note that 𝛼𝑐𝑜𝑜𝑠 is a vector of the same shape as 𝐶𝑂𝑂𝑟𝑜𝑤𝑠 . By
𝑠
padding all 𝐶𝑂𝑂 to the same length, the attention scores can be Least upper & Greatest lower bound. Let 𝐸 (0) = 𝑁 denote
computed in parallel and the final attention scores equal to the sum the number of nodes in a tree. We have |𝛿𝐴 > 0| = 𝐸 (0) + 2(𝐸 (1) +
of all 𝛼𝑐𝑜𝑜𝑠 : 𝐸 (2) + . . . 𝐸 (𝑃)) since we consider both positive and negative dis-
2𝑃+1 tance in 𝛿𝐴 . Based on the above two theorems, we can have:
𝛼𝑐𝑜𝑜 = 𝛼𝑐𝑜𝑜𝑠
𝑠=1 𝐸 (𝑖) ≤ 𝐸 (𝑖 − 1) − 1 ≤ . . . 𝐸 (0) − 𝑖 = 𝑁 − 𝑖
154
Authorized licensed use limited to: China University of Petroleum. Downloaded on April 11,2023 at 12:23:20 UTC from IEEE Xplore. Restrictions apply.
ICSE ’22, May 21–29, 2022, Pittsburgh, PA, USA Ze Tang, Xiaoyu Shen, Chuanyi Li, Jidong Ge, Liguo Huang, and Zhelin Zhu, Bin Luo
Figure 4: |𝛿𝐴 > 0| in case of random trees, the abscissa is the max Figure 5: Theoretical complexity with 𝑃 = 5, 𝑚 = 32. loop has the
relative distance 𝑃 and the ordinate is the non-zero elements in 𝛿𝐴 lowest complexity but cannot be parallelized in practice.
with the unit of 𝑂 (𝑁 ). The coefficient decreases with growing 𝑃.
155
Authorized licensed use limited to: China University of Petroleum. Downloaded on April 11,2023 at 12:23:20 UTC from IEEE Xplore. Restrictions apply.
AST-Trans: Code Summarization with Efficient Tree-Structured Attention ICSE ’22, May 21–29, 2022, Pittsburgh, PA, USA
Table 3: Comparison of AST-Trans with the baseline methods, categorized based on the input type. * means implemented by ourselves.
Java Python
Methods Input
BLEU (%) METEOR (%) ROUGE-L (%) BLEU (%) METEOR (%) ROUGE-L (%)
CODE-NN[20] 27.6 12.61 41.10 17.36 09.29 37.81
API+CODE[19] 41.31 23.73 52.25 15.36 08.57 33.65
Dual Model[53] Code 42.39 25.77 53.61 21.80 11.14 39.45
BaseTrans*[1] 44.58 29.12 53.63 25.77 16.33 38.95
Code-Transformer*[57] 45.74 29.65 54.96 30.93 18.42 43.67
Tree2Seq[11] 37.88 22.55 51.50 20.07 08.96 35.64
RL+Hybrid2Seq[51] 38.22 22.75 51.91 19.28 09.75 39.34
GCN*[22] AST(Tree) 43.94 28.92 55.45 32.31 19.54 39.67
GAT*[50] 44.63 29.19 55.84 32.16 19.30 39.12
Graph-Transformer*[40] 44.68 29.29 54.98 32.55 19.58 39.66
Code2Seq*[4] 24.42 15.35 33.95 17.54 08.49 20.93
AST(PD)
Code2Seq(Transformer)* 35.08 21.69 42.77 29.79 16.73 40.59
DeepCom[18] 39.75 23.06 52.67 20.78 09.98 37.35
Transformer(SBT)* AST(SBT) 43.37 28.36 52.37 31.33 19.02 44.09
AST-Trans(SBT)* 44.15 29.58 54.73 32.86 19.89 45.92
Transformer(POT)* 39.62 26.30 50.63 31.86 19.63 44.73
AST(POT)
AST-Trans 48.29 30.94 55.85 34.72 20.71 47.77
156
Authorized licensed use limited to: China University of Petroleum. Downloaded on April 11,2023 at 12:23:20 UTC from IEEE Xplore. Restrictions apply.
ICSE ’22, May 21–29, 2022, Pittsburgh, PA, USA Ze Tang, Xiaoyu Shen, Chuanyi Li, Jidong Ge, Liguo Huang, and Zhelin Zhu, Bin Luo
it with graph neural network (GNN) models. We consider three Table 4: Ablation study on AST-Trans with/without 𝐴 and 𝑆.
kinds of GNN models including GCN [22], GAT[50] and Graph-
Transformer [40]. The edges fed to GNN includes the ancestor- Model Dataset BLEU (%) METEOR (%) ROUGE (%)
descendant and sibling edges, distinguished by the edge attributes. AST-Trans w/o A 47.74 30.21 54.56
AST-Trans w/o S Java 48.07 30.62 55.29
3: AST(PD). Models with the AST linearized with path decom- AST-Trans 48.29 30.94 55.85
position as input. Path representation needs to be encoded from AST-Trans w/o A 34.35 20.15 46.62
the nodes, then the whole AST representation is encoded from AST-Trans w/o S Python 34.32 20.28 46.87
AST-Trans 34.72 20.71 47.77
the path representations. Code2Seq [4] is the first approach us-
ing PD, and it used two LSTM models to encode hierarchical net-
works. For fairness of comparison, we also design a new baseline
Code2Seq(Transformer) by replacing these two LSTM models Table 5: Ablation study on ℎ𝐴 and ℎ𝑆 on Java Dataset.
with the Transformer.
4: AST(SBT). Models with the AST linearized with Structure- ℎ𝐴 ℎ𝑆 BLEU (%) METEOR (%) ROUGE-L (%)
based Traversal as input. DeepCom [18] is the first work that uses 0 8 47.74 30.21 54.56
AST (SBT) as input, which encodes it with LSTM. We design a new 1 7 48.29 30.94 55.85
2 6 48.28 30.94 55.64
baseline Transformer (SBT) that encodes AST (SBT) with the
3 5 48.25 30.92 55.66
Transformer. AST-Trans(SBT) is our proposed model that inputs 4 4 48.23 30.96 55.68
SBT with relationship matrices. 5 3 48.11 30.93 55.46
5: AST(POT). Models with the AST linearized with pre-order- 6 2 48.1 30.74 55.22
traversal as input. Transformer (POT) is the standard Trans- 7 1 48.24 30.91 55.57
former architecture with AST (POT) as input and AST-Trans is 8 0 48.07 30.62 55.29
our proposed model with tree-structured attention.
All Transformer-based models are based on the relative position
embeddings with disentangled attention mentioned in Section 3.3
with the same number of parameters. The same hype-parameters are among these three linearization methods. Using the AST (PD) as
used through the way for a fully fair comparison. input leads to poor performance on both datasets. There are two main
reasons. On the one hand, AST(PD) method was first proposed for
6.3 Main Results method name completion. Method names are much shorter than the
The main result of AST-Trans and the baselines are presented in code summaries, and do not include many details. PD linearization
Table 3 5 . AST-Trans outperforms all the baselines on all the three extracts features from paths, which aggregates high-level charac-
metrics. Specifically, it outperforms the best baseline by 3.61, 2.17 ters but ignores the detailed information in the node. However, code
in BLEU, 1.65, 1.08 in METEOR and 0.87, 3.04 in ROUGE-L on the summarization requires more detailed information in the code such
Java and Python datasets respectively. as the type of the return value, which is stored in the leaf nodes. On
Code vs AST (Tree) vs AST (linearized). Apart from AST- the other hand, Code2Seq(Transformer) uses a hierarchical network
Trans, on both two datasets, using GNNs to encode AST (Tree) achieved and the amount of trained parameters is much larger. It is thereby
the best results. The reason is that the AST has both structural and harder to converge than Transformer(SBT) and Transformer(POT).
semantic information, and the other two input types both lose part Impact of relationship matrix 𝑅. We compared the perfor-
of the structural information. All three variants of GNNs achieve mance of three kinds of inputs with or without the relation matrix 𝑅:
similar results and outperform the Tree-LSTM in encoding the AST Code-Transformer vs BaseTrans, AST-Trans (SBT) vs Transformer
(Tree). Compared with taking the linearized AST as input, models (SBT) and AST-Trans (POT) vs Transformer(POT). Results show
only using the code perform better on the Java dataset but worse on that adding 𝑅 improves the performance for all these inputs and AST-
the Python dataset. This could be related to the code length. As code Trans (POT) performs the best. This is because Code-Transformer
and corresponding ASTs in Python are relatively shorter, encoding ignores non-leaf node information, and AST-Trans (SBT) stores
ASTs is more effective than in the Java dataset. Therefore, mod- duplicate information, resulting in too long sequence length. AST-
els using linearized ASTs, with the help of additional structural Trans (POT) maintains a short sequence length without losing
information, are able to outperform models using only the code. necessary structural or semantic information.
AST(PD) vs AST(SBT) vs AST(POT). Among three lineariza- AST-Trans vs GNN. AST-Trans outperforms GNNs, the best-
tion methods, when using the Transformer encoder/decoders, AST performed baseline model in both datasets. With the help of rela-
(SBT) performs the best on the Java dataset and AST (POT) performs tionship matrix, AST-Trans includes additional relative distance
the best on the Python dataset. AST(SBT) and AST(POT) both have information. Nodes can perceive information from its 𝑝-distance
their own advantages. AST(SBT) maintains more structural infor- neighbors at each layer. For GNN, however, each node needs 𝑝
mation than AST(POT) while AST(POT) has the shortest length hops to propagate information from these neighbors. In addition,
AST-Trans uses multi-head mechanism to compute different rela-
5 The results of BaseTrans [1] in the Python dataset are lower than reported in the paper tionships in different heads, while all relationships, distinguished by
(-6.75 BLEU, -3.44 METEOR and -7.78 ROUGE), then we set max relative distance 𝑃 to edge attribute, are calculated together in GNNs. AST-Trans also uses
16 (kept the same as original paper) and get 27.27(-5.25) BLEU, 15.90(-3.87) METEOR,
38.58(-8.15) ROUGE-L. This reduction may be because that we additionally segment extra feed-forward layers and residual connections in the encoder,
multi-words in comments. which could help improve the model generalization.
157
Authorized licensed use limited to: China University of Petroleum. Downloaded on April 11,2023 at 12:23:20 UTC from IEEE Xplore. Restrictions apply.
AST-Trans: Code Summarization with Efficient Tree-Structured Attention ICSE ’22, May 21–29, 2022, Pittsburgh, PA, USA
5 1 47.45 30.11 54.28
5 3 47.82 30.29 54.62
5 5 48.14 30.77 55.45
10 5 48.29 30.94 55.85
158
Authorized licensed use limited to: China University of Petroleum. Downloaded on April 11,2023 at 12:23:20 UTC from IEEE Xplore. Restrictions apply.
ICSE ’22, May 21–29, 2022, Pittsburgh, PA, USA Ze Tang, Xiaoyu Shen, Chuanyi Li, Jidong Ge, Liguo Huang, and Zhelin Zhu, Bin Luo
def job_delete_by_tag(tag):
6.6 Visualization and Qualitative Analysis Job.objects.get(tag=tag).delete()
return (job_get_by_tag(tag) is None)
Visualization. We further visualize the relative position represen-
tations of ancestor-descendant (𝐴) and sibling (𝑆) relationships in AST-Trans w/o S: delete a job and return tag
AST-Trans w/o A: delete a job objects
Fig 8. As can be seen, the variance of relative position embeddings
AST-Trans: delete a job based on its tag
in 𝑆 is much larger than in 𝐴. It implies that our model is not sensi-
Human Written: deletes a job entry based on its tag
tive to the relative distance between ancestor and descendant nodes,
as the embeddings are almost the same regardless of the positions.
In contrast, the variance for sibling nodes is relatively large, and We select two widely used ones to evaluate the proposed AST-
the model can distinguish the sibling nodes with different relative Transformer, but they may not be representative of other program-
distances. In addition, the relative embeddings in 𝐴 are demarcated ming languages. Secondly, to ensure a fair comparison as much as
between the upper and lower part, suggesting a clear distinction possible, we build baselines on the top of the same Transformer
between ancestor and descendant nodes. It shows that our model architecture. The architecture and hyperparameter choice might be
pays more attention to direction rather than distance in 𝐴. It is likely sub-optimal for certain approaches 6 . Finally, there will be a certain
that the exact distance between sibling nodes are more important gap between the automatic evaluation and the manual evaluation
than that between ancestor-descendant nodes in ASTs. of the summarization results. We select three different automatic
Qualitative analysis. We provide a couple of examples for qualita- evaluation methods to avoid bias as much as possible.
tive analysis in Table 8. It can be observed that AST-Trans generates
the closest summary to the reference, and lack of 𝐴 or 𝑆 hurts the 8 RELATED WORKS
quality of summarization. In the first case, the key information is Code Summarization. Most approaches on code summarization
the connection between the sibling nodes method call (“addAll”) frame the problem as a sequence generation task and use an encoder-
and parameter (“actions”). Both AST-Trans and AST-Trans w/o 𝐴 decoder architecture. The only difference between it and traditional
generates the summary as a batch add operation, while AST-Trans machine translation is that programming languages are unam-
w/o 𝑆 misunderstands it as “adds an action”. On the contrary, the biguous and follow rigid grammar rules. Most approaches either
meaning of the third case is to get job by the tag first then delete treat the source code as natural language (i.e., a sequence of to-
it. The order of execution is controlled by the ancestor-descent kens without specified structures), or utilize its structural informa-
relationship (the method call “get” is the child node of “delete”), and tion with the help from ASTs or other parsed forms. To encode
AST-Trans w/o 𝐴 just ignores the “get” operation. The summaries of the code sequence, there exist many encoder architectures like
AST-Trans w/o 𝐴 and w/o 𝑆 are both correct in the second case. The CNN [3], RNN [20, 55] and the Transformer [1]. To leverage the
statements of the second case are relatively simple and ignoring tree-structured AST, tree-based models such as Recursive NN [26],
the order of statements will not affect the function comprehension. Tree-LSTM [41, 51] and Tree-Transformer [15, 52], are used to en-
code AST directly. As tree is a special kind of graph, graph-based
approaches [2, 12, 23] can also be used to encode ASTs. Some works
7 THREATS TO VALIDITY also combine the code token sequence with the AST and observe
improvement [23–25]. Our approach only needs the linearized AST
There are three main threats to the validity of our evaluation. Firstly,
many public datasets are proposed to explore code summarization. 6 Nevertheless, AST-Trans performs best among all reported results on both datasets.
159
Authorized licensed use limited to: China University of Petroleum. Downloaded on April 11,2023 at 12:23:20 UTC from IEEE Xplore. Restrictions apply.
AST-Trans: Code Summarization with Efficient Tree-Structured Attention ICSE ’22, May 21–29, 2022, Pittsburgh, PA, USA
and can be built upon the Transformer architecture. More impor- [2] Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2018. Learning
tantly, it restricts the attention range and makes it possible to encode to Represent Programs with Graphs. In 6th International Conference on Learning
Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Con-
very long AST sequences. ference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=
Tree-based Neural Networks. The existing tree-based neural net- BJOFETxR-
[3] Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016. A Convolutional
works can be grouped into two categories depending on their inputs: Attention Network for Extreme Summarization of Source Code. In Proceedings of
(1) The models that directly take the tree as input [15, 31, 34, 47]. the 33nd International Conference on Machine Learning, ICML 2016, New York City,
These models are strongly coupled with the tree structure, and the NY, USA, June 19-24, 2016 (JMLR Workshop and Conference Proceedings, Vol. 48),
Maria-Florina Balcan and Kilian Q. Weinberger (Eds.). JMLR.org, 2091–2100.
calculation process needs to be performed simultaneously with http://proceedings.mlr.press/v48/allamanis16.html
the tree traversal. Since trees generally have different shapes by [4] Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. 2019. code2seq: Generating
nature, parallization of training these models is non-trivial. (2) The Sequences from Structured Representations of Code. In 7th International Con-
ference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9,
models that take the sequence(s) extracted from the tree as input, 2019. OpenReview.net. https://openreview.net/forum?id=H1gKYo09tX
such as the sampled paths in the tree [4, 21], the traversal sequence [5] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine
Translation by Jointly Learning to Align and Translate. In 3rd International
with tree positional embedding [42] or the structure based traver- Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May
sal (SBT) sequence [18]. Taking sampled paths as input is with a 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.).
certain degree of randomness and instability, and the method of http://arxiv.org/abs/1409.0473
[6] Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT
tree positional embedding ignores the concept of paths in the tree Evaluation with Improved Correlation with Human Judgments. In Proceedings of
(all nodes, even if not related, will participate in the calculation the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Transla-
together). Our method improves from these two methods, which tion and/or Summarization@ACL 2005, Ann Arbor, Michigan, USA, June 29, 2005,
Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare R. Voss (Eds.). Association
can be guaranteed that each node exchanges message on and only for Computational Linguistics, 65–72. https://www.aclweb.org/anthology/W05-
on all paths containing it. 0909/
[7] Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long-
Document Transformer. CoRR abs/2004.05150 (2020). arXiv:2004.05150 https:
9 CONCLUSION //arxiv.org/abs/2004.05150
[8] Ernie Chang, Xiaoyu Shen, Hui-Syuan Yeh, and Vera Demberg. 2021. On Training
In this paper, we present AST-Trans which can encode ASTs effec- Instance Selection for Few-Shot Neural Text Generation. In Proceedings of the
tively for code summarization. In AST-Trans, each node only pays 59th Annual Meeting of the Association for Computational Linguistics and the 11th
International Joint Conference on Natural Language Processing (Volume 2: Short
attention to nodes which share the ancestor-descendent or sibling Papers). 8–13.
relationships with it. It brings two benefits: (1) the model is given [9] Jianbo Dong, Zheng Cao, Tao Zhang, Jianxi Ye, Shaochuang Wang, Fei Feng, Li
Zhao, Xiaoyong Liu, Liuyihan Song, Liwei Peng, et al. 2020. Eflops: Algorithm
an inductive bias and will not get lost in the overlong AST sequence, and system co-design for a high performance distributed training platform. In
and (2) it can reduce the computational complexity from quadratic 2020 IEEE International Symposium on High Performance Computer Architecture
to linear. The latter makes it possible to encode long code sequence, (HPCA). IEEE, 610–622.
[10] Jianbo Dong, Shaochuang Wang, Fei Feng, Zheng Cao, Heng Pan, Lingbo Tang,
e.g., a whole file, which is prohibitively expensive for standard Pengcheng Li, Hao Li, Qianyuan Ran, Yiqun Guo, et al. 2021. ACCL: Architecting
Transformers. We conduct comprehensive experiments, showing Highly Scalable Distributed Training Systems with Highly-Efficient Collective
that AST-Trans achieve SOTA results on two popular benchmarks Communication Library. IEEE Micro (2021).
[11] Akiko Eriguchi, Kazuma Hashimoto, and Yoshimasa Tsuruoka. 2016. Tree-to-
while significantly reducing the computational cost. Sequence Attentional Neural Machine Translation. In Proceedings of the 54th
We believe the basic idea of AST-Trans can also be applied in Annual Meeting of the Association for Computational Linguistics, ACL 2016, August
7-12, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer
other structured data like data dependence and control flow graphs. Linguistics. https://doi.org/10.18653/v1/p16-1078
The code is made publicly available to benefit the relevant research. [12] Patrick Fernandes, Miltiadis Allamanis, and Marc Brockschmidt. 2019. Struc-
In future work, we plan to improve AST-Trans by incorporating tured Neural Summarization. In 7th International Conference on Learning Rep-
resentations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
more features of the code snippet, such as API sequence and node https://openreview.net/forum?id=H1ersoRqtm
type, into the self-attention mechanism. [13] Sonia Haiduc, Jairo Aponte, and Andrian Marcus. 2010. Supporting program
comprehension with source code summarization. In Proceedings of the 32nd
ACM/IEEE International Conference on Software Engineering - Volume 2, ICSE 2010,
10 ACKNOWLEDGMENTS Cape Town, South Africa, 1-8 May 2010, Jeff Kramer, Judith Bishop, Premkumar T.
Devanbu, and Sebastián Uchitel (Eds.). ACM, 223–226. https://doi.org/10.1145/
This work is supported by National Natural Science Foundation of 1810295.1810335
China (61802167,61802095) ,Natural Science Foundation of Jiangsu [14] Sonia Haiduc, Jairo Aponte, Laura Moreno, and Andrian Marcus. 2010. On the
Use of Automated Text Summarization Techniques for Summarizing Source Code.
Province (No.BK20201250),Cooperation Fund of Huawei-NJU Cre- In 17th Working Conference on Reverse Engineering, WCRE 2010, 13-16 October
ative Laboratory for the Next Programming, and NSF award 2034508. 2010, Beverly, MA, USA, Giuliano Antoniol, Martin Pinzger, and Elliot J. Chikofsky
We thank Alibaba Cloud for its high-efficient AI computing service (Eds.). IEEE Computer Society, 35–44. https://doi.org/10.1109/WCRE.2010.13
[15] Jacob Harer, Christopher P. Reale, and Peter Chin. 2019. Tree-Transformer:
from EFlops Cluster. We also thank the reviewers for their help- A Transformer-Based Method for Correction of Tree-Structured Data. CoRR
ful comments. Chuanyi Li and Jidong Ge are the corresponding abs/1908.00449 (2019). arXiv:1908.00449 http://arxiv.org/abs/1908.00449
[16] Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2021. Deberta:
authors. decoding-Enhanced Bert with Disentangled Attention. In 9th International Con-
ference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7,
2021. OpenReview.net. https://openreview.net/forum?id=XPZIaotutsD
REFERENCES [17] Dan Hendrycks and Kevin Gimpel. 2016. Bridging Nonlinearities and Stochastic
[1] Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2020. Regularizers with Gaussian Error Linear Units. CoRR abs/1606.08415 (2016).
A Transformer-based Approach for Source Code Summarization. In Proceedings arXiv:1606.08415 http://arxiv.org/abs/1606.08415
of the 58th Annual Meeting of the Association for Computational Linguistics, ACL [18] Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2018. Deep code comment
2020, Online, July 5-10, 2020, Dan Jurafsky, Joyce Chai, Natalie Schluter, and generation. In Proceedings of the 26th Conference on Program Comprehension, ICPC
Joel R. Tetreault (Eds.). Association for Computational Linguistics, 4998–5007. 2018, Gothenburg, Sweden, May 27-28, 2018, Foutse Khomh, Chanchal K. Roy, and
https://doi.org/10.18653/v1/2020.acl-main.449 Janet Siegmund (Eds.). ACM, 200–210. https://doi.org/10.1145/3196321.3196334
160
Authorized licensed use limited to: China University of Petroleum. Downloaded on April 11,2023 at 12:23:20 UTC from IEEE Xplore. Restrictions apply.
ICSE ’22, May 21–29, 2022, Pittsburgh, PA, USA Ze Tang, Xiaoyu Shen, Chuanyi Li, Jidong Ge, Liguo Huang, and Zhelin Zhu, Bin Luo
[19] Xing Hu, Ge Li, Xin Xia, David Lo, Shuai Lu, and Zhi Jin. 2018. Summarizing Association for Computational Linguistics, 464–468. https://doi.org/10.18653/
Source Code with Transferred API Knowledge. In Proceedings of the Twenty- v1/n18-2074
Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July [37] Xiaoyu Shen, Youssef Oualil, Clayton Greenberg, Mittul Singh, and Dietrich
13-19, 2018, Stockholm, Sweden, Jérôme Lang (Ed.). ijcai.org, 2269–2275. https: Klakow. 2017. Estimation of Gap Between Current Language Models and Human
//doi.org/10.24963/ijcai.2018/314 Performance. Proc. Interspeech 2017 (2017), 553–557.
[20] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. [38] Xiaoyu Shen, Jun Suzuki, Kentaro Inui, Hui Su, Dietrich Klakow, and Satoshi
Summarizing Source Code using a Neural Attention Model. In Proceedings of the Sekine. 2019. Select and Attend: Towards Controllable Content Selection in
54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, Text Generation. In Proceedings of the 2019 Conference on Empirical Methods in
August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Natural Language Processing and the 9th International Joint Conference on Natural
Computer Linguistics. https://doi.org/10.18653/v1/p16-1195 Language Processing (EMNLP-IJCNLP). 579–590.
[21] Seohyun Kim, Jinman Zhao, Yuchi Tian, and Satish Chandra. 2021. Code Predic- [39] Xiaoyu Shen, Yang Zhao, Hui Su, and Dietrich Klakow. 2019. Improving la-
tion by Feeding Trees to Transformers. In 43rd IEEE/ACM International Conference tent alignment in text summarization by generalizing the pointer generator. In
on Software Engineering, ICSE 2021, Madrid, Spain, 22-30 May 2021. IEEE, 150–162. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Pro-
https://doi.org/10.1109/ICSE43902.2021.00026 cessing and the 9th International Joint Conference on Natural Language Processing
[22] Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with (EMNLP-IJCNLP). 3753–3764.
Graph Convolutional Networks. In 5th International Conference on Learning [40] Yunsheng Shi, Zhengjie Huang, Shikun Feng, Hui Zhong, Wenjing Wang, and
Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Yu Sun. 2021. Masked Label Prediction: Unified Message Passing Model for
Proceedings. OpenReview.net. https://openreview.net/forum?id=SJU4ayYgl Semi-Supervised Classification. In Proceedings of the Thirtieth International Joint
[23] Alexander LeClair, Sakib Haque, Lingfei Wu, and Collin McMillan. 2020. Improved Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada,
Code Summarization via a Graph Neural Network. In ICPC ’20: 28th International 19-27 August 2021, Zhi-Hua Zhou (Ed.). ijcai.org, 1548–1554. https://doi.org/10.
Conference on Program Comprehension, Seoul, Republic of Korea, July 13-15, 2020. 24963/ijcai.2021/214
ACM, 184–195. https://doi.org/10.1145/3387904.3389268 [41] Yusuke Shido, Yasuaki Kobayashi, Akihiro Yamamoto, Atsushi Miyamoto, and
[24] Alexander LeClair, Siyuan Jiang, and Collin McMillan. 2019. A neural model for Tadayuki Matsumura. 2019. Automatic Source Code Summarization with Ex-
generating natural language summaries of program subroutines. In Proceedings tended Tree-LSTM. In International Joint Conference on Neural Networks, IJCNN
of the 41st International Conference on Software Engineering, ICSE 2019, Montreal, 2019 Budapest, Hungary, July 14-19, 2019. IEEE, 1–8. https://doi.org/10.1109/
QC, Canada, May 25-31, 2019, Joanne M. Atlee, Tevfik Bultan, and Jon Whittle IJCNN.2019.8851751
(Eds.). IEEE / ACM, 795–806. https://doi.org/10.1109/ICSE.2019.00087 [42] Vighnesh Leonardo Shiv and Chris Quirk. 2019. Novel positional encodings to
[25] Boao Li, Meng Yan, Xin Xia, Xing Hu, Ge Li, and David Lo. 2020. DeepCom- enable tree-based transformers. In Advances in Neural Information Processing
menter: a deep code comment generation tool with hybrid lexical and syntactical Systems 32: Annual Conference on Neural Information Processing Systems 2019,
information. In ESEC/FSE ’20: 28th ACM Joint European Software Engineering NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, Hanna M. Wallach,
Conference and Symposium on the Foundations of Software Engineering, Virtual Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and
Event, USA, November 8-13, 2020, Prem Devanbu, Myra B. Cohen, and Thomas Roman Garnett (Eds.). 12058–12068.
Zimmermann (Eds.). ACM, 1571–1575. https://doi.org/10.1145/3368089.3417926 [43] Giriprasad Sridhara, Emily Hill, Divya Muppaneni, Lori L. Pollock, and K. Vijay-
[26] Yuding Liang and Kenny Qili Zhu. 2018. Automatic Generation of Text Descriptive Shanker. 2010. Towards automatically generating summary comments for Java
Comments for Code Blocks. In Proceedings of the Thirty-Second AAAI Conference methods. In ASE 2010, 25th IEEE/ACM International Conference on Automated
on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Software Engineering, Antwerp, Belgium, September 20-24, 2010, Charles Pecheur,
Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances Jamie Andrews, and Elisabetta Di Nitto (Eds.). ACM, 43–52. https://doi.org/10.
in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 1145/1858996.1859006
2018, Sheila A. McIlraith and Kilian Q. Weinberger (Eds.). AAAI Press, 5229–5236. [44] Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16492 Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from
[27] Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. overfitting. J. Mach. Learn. Res. 15, 1 (2014), 1929–1958. http://dl.acm.org/citation.
In Text Summarization Branches Out. Association for Computational Linguistics, cfm?id=2670313
Barcelona, Spain, 74–81. https://www.aclweb.org/anthology/W04-1013 [45] Hui Su, Xiaoyu Shen, Zhou Xiao, Zheng Zhang, Ernie Chang, Cheng Zhang,
[28] Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. Cheng Niu, and Jie Zhou. 2020. Moviechats: Chat like humans in a closed domain.
In 7th International Conference on Learning Representations, ICLR 2019, New Or- In Proceedings of the 2020 Conference on Empirical Methods in Natural Language
leans, LA, USA, May 6-9, 2019. OpenReview.net. https://openreview.net/forum? Processing (EMNLP). 6605–6619.
id=Bkg6RiCqY7 [46] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbig-
[29] Paul W. McBurney and Collin McMillan. 2016. Automatic Source Code Sum- niew Wojna. 2016. Rethinking the Inception Architecture for Computer Vi-
marization of Context for Java Methods. IEEE Trans. Software Eng. 42, 2 (2016), sion. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR
103–119. https://doi.org/10.1109/TSE.2015.2465386 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society, 2818–2826.
[30] Laura Moreno, Jairo Aponte, Giriprasad Sridhara, Andrian Marcus, Lori L. Pollock, https://doi.org/10.1109/CVPR.2016.308
and K. Vijay-Shanker. 2013. Automatic generation of natural language summaries [47] Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved
for Java classes. In IEEE 21st International Conference on Program Comprehension, Semantic Representations From Tree-Structured Long Short-Term Memory Net-
ICPC 2013, San Francisco, CA, USA, 20-21 May, 2013. IEEE Computer Society, works. In Proceedings of the 53rd Annual Meeting of the Association for Computa-
23–32. https://doi.org/10.1109/ICPC.2013.6613830 tional Linguistics and the 7th International Joint Conference on Natural Language
[31] Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional Neural Processing of the Asian Federation of Natural Language Processing, ACL 2015, July
Networks over Tree Structures for Programming Language Processing. In Pro- 26-31, 2015, Beijing, China, Volume 1: Long Papers. The Association for Computer
ceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, Linguistics, 1556–1566. https://doi.org/10.3115/v1/p15-1150
2016, Phoenix, Arizona, USA, Dale Schuurmans and Michael P. Wellman (Eds.). [48] Ze Tang, Chuanyi Li, Jidong Ge, Xiaoyu Shen, Zheling Zhu, and Bin Luo. 2021.
AAAI Press, 1287–1293. http://www.aaai.org/ocs/index.php/AAAI/AAAI16/ AST-Transformer: Encoding Abstract Syntax Trees Efficiently for Code Summa-
paper/view/11775 rization. arXiv preprint arXiv:2112.01184 (2021).
[32] Genevieve B. Orr and Klaus-Robert Müller (Eds.). 1998. Neural Networks: Tricks [49] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
of the Trade. Lecture Notes in Computer Science, Vol. 1524. Springer. https: Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All
//doi.org/10.1007/3-540-49430-8 you Need. In Advances in Neural Information Processing Systems 30: Annual Con-
[33] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a ference on Neural Information Processing Systems 2017, 4-9 December 2017, Long
Method for Automatic Evaluation of Machine Translation. In Proceedings of the Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M.
40th Annual Meeting of the Association for Computational Linguistics, July 6-12, Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 5998–6008.
2002, Philadelphia, PA, USA. ACL, 311–318. https://www.aclweb.org/anthology/ http://papers.nips.cc/paper/7181-attention-is-all-you-need
P02-1040/ [50] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro
[34] Jordan B. Pollack. 1990. Recursive Distributed Representations. Artif. Intell. 46, Liò, and Yoshua Bengio. 2018. Graph Attention Networks. In 6th International
1-2 (1990), 77–105. https://doi.org/10.1016/0004-3702(90)90005-K Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30
[35] Heinz Prüfer. 1918. Neuer beweis eines satzes über permutationen. Arch. Math. - May 3, 2018, Conference Track Proceedings. OpenReview.net. https://openreview.
Phys 27, 1918 (1918), 742–744. net/forum?id=rJXMpikCZ
[36] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-Attention with [51] Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, Haochao Ying, Jian Wu, and
Relative Position Representations. In Proceedings of the 2018 Conference of the Philip S. Yu. 2018. Improving automatic source code summarization via deep
North American Chapter of the Association for Computational Linguistics: Human reinforcement learning. In Proceedings of the 33rd ACM/IEEE International Confer-
Language Technologies, NAACL-HLT, New Orleans, Louisiana, USA, June 1-6, 2018, ence on Automated Software Engineering, ASE 2018, Montpellier, France, September
Volume 2 (Short Papers), Marilyn A. Walker, Heng Ji, and Amanda Stent (Eds.). 3-7, 2018, Marianne Huchard, Christian Kästner, and Gordon Fraser (Eds.). ACM,
161
Authorized licensed use limited to: China University of Petroleum. Downloaded on April 11,2023 at 12:23:20 UTC from IEEE Xplore. Restrictions apply.
AST-Trans: Code Summarization with Efficient Tree-Structured Attention ICSE ’22, May 21–29, 2022, Pittsburgh, PA, USA
162
Authorized licensed use limited to: China University of Petroleum. Downloaded on April 11,2023 at 12:23:20 UTC from IEEE Xplore. Restrictions apply.