0% found this document useful (0 votes)
2 views

AST-Trans- Code Summarization with Efficient Tree-Structured

AST-Trans- Code Summarization with Efficient Tree-Structured

Uploaded by

zhuyang9158
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

AST-Trans- Code Summarization with Efficient Tree-Structured

AST-Trans- Code Summarization with Efficient Tree-Structured

Uploaded by

zhuyang9158
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE)

AST-Trans: Code Summarization with Efficient Tree-Structured


Attention
Ze Tang Xiaoyu Shen∗ Chuanyi Li, Jidong Ge
State Key Laboratory for Novel Alexa AI State Key Laboratory for Novel
Software Technology Amazon Software Technology
Nanjing University Berlin, Germany Nanjing University
Nanjing, China [email protected] Nanjing, China
[email protected] lcy,[email protected]

Liguo Huang Zhelin Zhu, Bin Luo


Department of Computer Science State Key Laboratory for Novel
Southern Methodist University Software Technology
Dallas, Texas, USA Nanjing University
[email protected] Nanjing, China
zzl,[email protected]
ABSTRACT CCS CONCEPTS
Code summarization aims to generate brief natural language de- • Software and its engineering → Documentation; • Comput-
scriptions for source codes. The state-of-the-art approaches follow ing methodologies → Natural language generation.
a transformer-based encoder-decoder architecture. As the source
code is highly structured and follows strict grammars, its Abstract KEYWORDS
Syntax Tree (AST) is widely used for encoding structural infor- tree-based neural network, source code summarization
mation. However, ASTs are much longer than the corresponding
source code. Existing approaches ignore the size constraint and ACM Reference Format:
simply feed the whole linearized AST into the encoders. We argue Ze Tang, Xiaoyu Shen, Chuanyi Li, Jidong Ge, Liguo Huang, and Zhelin
that such a simple process makes it difficult to extract the truly use- Zhu, Bin Luo. 2022. AST-Trans: Code Summarization with Efficient Tree-
Structured Attention. In 44th International Conference on Software Engineer-
ful dependency relations from the overlong input sequence. It also
ing (ICSE ’22), May 21–29, 2022, Pittsburgh, PA, USA. ACM, New York, NY,
incurs significant computational overhead since each node needs
USA, 13 pages. https://doi.org/10.1145/3510003.3510224
to apply self-attention to all other nodes in the AST. To encode
the AST more effectively and efficiently, we propose AST-Trans
in this paper which exploits two types of node relationships in 1 INTRODUCTION
the AST: ancestor-descendant and sibling relationships. It applies The summary of source code is a brief natural language description
the tree-structured attention to dynamically allocate weights for explaining the purpose of the code [29]. The code to be summarized
relevant nodes and exclude irrelevant nodes based on these two can be with different units. In this work, we focus on summarizing
relationships. We further propose an efficient implementation to the subroutines or defined methods in a program.
support fast parallel computation for tree-structure attention. On Previous studies have shown that such a short description can
the two code summarization datasets, experimental results show assist program developers to quickly digest the code without travers-
that AST-Trans significantly outperforms the state-of-the-arts while ing over it themselves [43]. Nonetheless, maintaining high-quality
being times more efficient than standard transformers 1 . code summaries requires expensive manual labor in reality. In many
projects, these summaries are often mismatched, missing or out-
dated which slow down the developing progress [18]. Automatic
∗ Work done before joining. code summarization can greatly save developers’ time by avoiding
1 All
the codes and data are available at https://github.com/zetang94/ICSE2022_AST_ writing such summaries manually for every single code snippet.
Trans.git
The traditional methods utilized handcrafted rules like Software
Word-Usage Model (SWUM) [43] or stereotypes [30] to synthe-
size the code summaries. However, when identifiers or methods
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed are poorly named, they cannot extract accurate keywords to pro-
for profit or commercial advantage and that copies bear this notice and the full citation duce good summaries. Some used Information Retrieval (IR) tech-
on the first page. Copyrights for components of this work owned by others than ACM niques [13, 14] to mine summaries from similar existing code banks
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a which, unfortunately, cannot generalize to unseen code snippets
fee. Request permissions from [email protected]. with different functions.
ICSE ’22, May 21–29, 2022, Pittsburgh, PA, USA Recently, with the development of open source platforms such as
© 2022 Association for Computing Machinery.
ACM ISBN 978-1-4503-9221-1/22/05. . . $15.00 Github, more and more data for code summarization can be easily
https://doi.org/10.1145/3510003.3510224 extracted from online resources. Data-driven strategies based on

150

Authorized licensed use limited to: China University of Petroleum. Downloaded on April 11,2023 at 12:23:20 UTC from IEEE Xplore. Restrictions apply.
ICSE ’22, May 21–29, 2022, Pittsburgh, PA, USA Ze Tang, Xiaoyu Shen, Chuanyi Li, Jidong Ge, Liguo Huang, and Zhelin Zhu, Bin Luo

   return 0 if x<0, else return x by the two relationship matrices to better model the dependency.
   itself.

We further describe several implementations of the proposed AST-
code summary
Trans and have a comprehensive analysis of their computational
     Return AST complexity. In short, the contributions of this paper are as below:
 
IfExp • We propose AST-Trans that can efficiently encode long AST
Compare body orelse sequences with linear complexity, in contrast with the qua-
dratic complexity of the standard Transformer.
NameLoad(x) Lt constant(0) constant(0) NameLoad(x)
• We perform a comprehensive analysis, with both theoretical
and empirical evidences, on the computational complexity
Figure 1: Example of code-AST-summary triples. We mainly need of different implementations.
to understand the ancestor-descendent and sibling relationships in • We validate our proposed model on two datasets of Java and
the AST to generate a summary. Python. Experimental results show that AST-Trans outper-
forms the state-of-the-arts by a substantial margin.
• We compare representative methods for AST encoding and
neural networks start to raise more and more attention [20, 37– discuss their pros and cons.
39, 56]. Current state-of-the-arts all follow the Transformer-based
Paper Organization The remainder of this paper is organized
encoder-decoder architecture [5, 8, 45, 48, 49] and can be trained
as follows. Section 2 presents background knowledge on the Trans-
end-to-end with code-summary pairs. Since the source code is
former and AST. Section 3 elaborates on the details of AST-Trans,
highly structured and follows strict programming language gram-
section 4 presents its different implementation and the complexity
mars, a common practice is to also leverage the Abstract Syntax
is analyzed in section 5. Section 6 explains the experimental setup
Tree (AST) to help the encoder digest the structured information.
and analyzes the results. Section 7 discusses threats to validity. Sec-
The AST is usually linearized by different algorithms like pre-order
tion 8 surveys the related work. Finally, section 9 concludes the
traversal [21], structure-based traversal (SBT) [18] and path decom-
paper and points out future research directions.
position [4], then fed into the encoder. Several works also proposed
architectures specific for tree encoding like tree-LSTM [11, 51].
However, the linearized ASTs, as containing additional struc- 2 BACKGROUND
tured information, are much longer than their corresponding source Transformer. The Transformer architecture was initially proposed
code sequence. Some linearization algorithms can further increase for neural machine translation [49]. It consists of multi-head stacked
the length. For example, linearizing with SBT usually makes the encoder and decoder layers. In each encoder stack, the inputs first
size times longer. This makes the model extremely difficult to accu- flow through a self-attention sublayer, and then are fed into a
rately detect useful dependency relations from the overlong input position-wise feed-forward network followed by a layer normaliza-
sequence 2 . Moreover, it brings significant computational overhead, tion. The decoder has a set of the cross-attention layers to help the
especially for state-of-the-art Transformer-based models where decoder focus on relevant parts of the input sequence. The Trans-
the number of self-attention operations grows quadratically with former architecture removes the recurrence mechanism in favor of
the sequence length. Encoding ASTs with tree-based models like the self-attention. As each word in a sentence simultaneously flows
tree-LSTM will incur extra complexity because it needs to traverse through the encoder and decoder stack, the model itself does not
the whole tree to obtain the state of each node. have any sense of the word order. Therefore, a position embedding
In this work, we assume that the state of a node in the AST is is added to each word embedding to inform the order information.
affected most by its (1) ancestor-descendent nodes, which represent Abstract Syntax Tree (AST). An Abstract Syntax Tree (AST)
the hierarchical relationship across different blocks, and (2) sibling uniquely represents a source code snippet in a given language
nodes, which represent the temporal relationship within one block. and grammar [4]. The leaves of the tree are terminals, usually re-
We show an example of code summarization in Figure 1. As can be ferring to variables, types and method names. The non-leaf nodes
seen, we need the ancestor-descendent relationship to understand are non-terminals and represent a restricted set of structures in the
the high-level procedure, and the sibling relationship to understand programming language, e.g., loops, expressions, and variable decla-
the low-level details within a block. Capturing these two relation- rations. For example, in Figure 1, variables (such as NameLoad(x))
ships are enough for producing the summary and modelling the are represented as terminals of AST. Syntactic structures (such as
full attention among all nodes is unnecessary. Compare) are represented as non-terminals. Since the variable and
Based on this intuition, we propose AST-Trans, a simple variant method names can be rather freely defined, directly processing the
of the Transformer model to efficiently handle the tree-structured source code can be challenging. Its corresponding AST, due to its
AST. AST-Trans exploits ancestor-descendant and sibling relation- strict structure, often serves as substitute when encoding the source
ship matrices to represent the tree-structure, and uses these ma- code.
trices to dynamically exclude irrelevant nodes in different self-
attention layers. The absolute position embedding from the original 3 AST-TRANS
Transformer is replaced with relative position embeddings defined
This section details our proposed AST-Trans. For an AST, it will
2 Indeed,encoding the overlong AST with SBT even underperforms directly encoding be firstly linearized into a sequence. Then the ancestor-descendent
the source code when using Transformer with relative position embeddings [1]. and sibling relationships among its nodes will be denoted through

151

Authorized licensed use limited to: China University of Petroleum. Downloaded on April 11,2023 at 12:23:20 UTC from IEEE Xplore. Restrictions apply.
AST-Trans: Code Summarization with Efficient Tree-Structured Attention ICSE ’22, May 21–29, 2022, Pittsburgh, PA, USA

Table 1: Linearized AST of the tree in Fig 1 with POT,SBT and PD.          
 


Methods Linearized AST sequence
Return IfExp Compare NameLoad(x) Lt constant(0) body constant(0) orelse
POT
NameLoad(x)  
  
( Return ( IfExp ( Compare ( constant(0) ) constant(0) ( Lt ) Lt ( NameLoad(x)
SBT ) NameLoad(x) ) Compare ( body ( constant(0) ) constant(0) ) body ( orelse              Ğ Ğ
( NameLoad(x) ) NameLoad(x) ) orelse ) IfExp ) Return         Ğ Ğ  
    Ğ  Ğ  
Path1: Path1: Lt Compare constant(0)
Path2: NameLoad(x) Compare constant(0)
PD
Path3: Path3: constant(0) Compare IfExp body constant(0)
... Figure 2: Example of generating position matrices for ancestor-
descendent (A) and sibling relationship (S). Position matrix gener-
ated from the linear relationship is used in standard Transformers.
two specific matrices. Based on the matrices, we replace the stan-
dard self-attention with tree-structured attention to better model
these two relationships. Irrelevant nodes are dynamically ruled
out to reduce computational cost. We will first introduce different Specifically, two nodes have the ancestor-descendant relationship if
linearization methods (section 3.1), then explain the construction there exists a directed path from root node that can traverse through
of two relationship matrices (section 3.2), and finally present the them. Two nodes have the sibling relationship if they share the
tree-structure attention to utilize the matrices(section 3.3). same parent node.
We use two position matrices 𝐴𝑁 ×𝑁 and 𝑆 𝑁 ×𝑁 to represent
3.1 AST Linearization the ancestor-descendent and sibling relationships respectively. 𝑁
is the total number of nodes in AST. We denote the 𝑖th node in
In order to encode the tree-shaped AST, it first needs to be converted
the linearized AST as 𝑛𝑖 . 𝐴𝑖 𝑗 is the distance of the shortest path
into a sequence with a linearization method. There are the three
between 𝑛𝑖 and 𝑛 𝑗 in the AST. 𝑆𝑖 𝑗 is horizontal sibling distance
most representative linearization methods used in current works:
between 𝑛𝑖 and 𝑛 𝑗 in the AST if they satisfy the sibling relationship.
(1) Pre-order Traversal (POT): It visits the tree nodes with pre- If one relationship is not satisfied, its value in the matrix will be
order traversal. Sequences obtained by pre-order traversal infinity. Note that we consider the relative relationship between two
are lossy since the original ASTs cannot be unambiguously nodes, which means 𝐴𝑖 𝑗 = −𝐴 𝑗𝑖 and 𝑆𝑖 𝑗 = −𝑆 𝑗𝑖 if a relationship
reconstructed back from them. exists between 𝑛𝑖 and 𝑛 𝑗 .
(2) Structure-based Traversal (SBT): It adds additional brack- Formally, we use SPD(𝑖, 𝑗) and SID(𝑖, 𝑗) to denote the Shorted
ets [18] to indicate the parental-descendent relationship such Path Distance and horizontal SIbling Distance between 𝑛𝑖 and 𝑛 𝑗
that each sequence can be unambiguously mapped back to in the AST. The values in the relationship matrices are defined as:
the AST, but it also doubles the size of the linearized se-
quence. 
(3) Path Decomposition (PD): It represents the AST by concate- SPD(𝑖, 𝑗) if |SPD(𝑖, 𝑗)| ≤ 𝑃
𝐴𝑖 𝑗 =
nating the path between two random leaf nodes. The total ∞ otherwise
 (1)
number of paths can be too large for computing and there- SID(𝑖, 𝑗) if |SID(𝑖, 𝑗)| ≤ 𝑃
𝑆𝑖 𝑗 =
fore random sampling is needed [4]. ∞ otherwise
Table 1 shows the AST in Figure 1 linearized with the above
three different methods. For POT and SBT, the linearized trees 𝑃 is a pre-defined threshold and nodes with relative distance
can be directly fed into the encoder. For PD, the average total beyond 𝑃 will be ignored. We hypothesize that precise relative dis-
number of paths can be over 200, concatenating them all to train tance is not useful beyond a certain range. It can both constrain the
is infeasible [4]. In practice, mean pooling is run over the states computation complexity within a constant range and save memory
of each path such that each path has one unique representation. space for storing the relative position embeddings. Figure 2 shows
The decoder only attends to these unique representations of paths an example of generating matrix 𝐴 and 𝑆, in comparison with the
instead of specific nodes within paths. This can affect the model position matrix generated from a linear relationship, which is used
when copying user-defined names (in leaf nodes) is needed. in standard Transformers. In the next section, we will introduce
We adopt the simplest POT linearization for our model. We how to use these two matrices to dynamically incorporate such
show that it has already achieved SOTA results and more complex relationship information through a tree-structured attention.
linearization methods like SBT do not help. PD does not apply to our
model since it treats one path as a whole. We will show in section 6.3 3.3 Tree-Structured Attention
that this leads to poor performance in code summarization. Tree-structured attention is built on the standard self-attention
with relative position embeddings and disentangled attention. It
3.2 Relationship Matrices replaces the relative position embeddings derived from the linear
We define two kinds of relationships between nodes in the tree that relationship into the two matrices derived from the tree structure.
we care about: ancestor-descendant (𝐴) and sibling (𝑆) relationships. Self-Attention. Standard self-attention transforms the input
The former represents the hierarchical information across blocks, sequence x = (𝑥 1, . . . , 𝑥𝑛 ) (𝑥𝑖 ∈ R𝑑 which stands for the embedding
and the latter represents the temporal information within one block. of 𝑛𝑖 ) into a sequence of output vectors o = (𝑜 1, . . . , 𝑜𝑛 ) (𝑜𝑖 ∈ R𝑑 ).

152

Authorized licensed use limited to: China University of Petroleum. Downloaded on April 11,2023 at 12:23:20 UTC from IEEE Xplore. Restrictions apply.
ICSE ’22, May 21–29, 2022, Pittsburgh, PA, USA Ze Tang, Xiaoyu Shen, Chuanyi Li, Jidong Ge, Liguo Huang, and Zhelin Zhu, Bin Luo

The single-head self-attention [49] can be formulated as: that it will not add any additional parameter on top of the standard
𝑸 (𝑥𝑖 )𝑲 (𝑥 𝑗 )  Transformer. ℎ𝐴 heads will use 𝛿𝐴 (𝑖, 𝑗) and the rest ℎ𝑆 heads will
𝜶 𝒊𝒋 = √ use 𝛿𝑆 (𝑖, 𝑗). Information from the two relationships will be merged
𝑑 together through multi-head attention. We then replace 𝛿 (𝑖, 𝑗) in
𝑛
 (2)
Eq 4 with 𝛿𝑅 (𝑖, 𝑗) in Formula 5, and apply a scaling factor of √1 on
𝑜𝑖 = 𝜎 (𝜶 𝒊𝒋 )𝑽 (𝑥 𝑗 ) 3𝑑
𝑗=1 𝛼˜𝑖,𝑗 (because it has 3 items). The final output vector is computed as
in Eq (6), where 𝑽 𝑷 represents the value project matrix of relative
where 𝑸, 𝑲 : R𝑑 → R𝑚 are query and key functions respectively, distances and 𝑽𝑹𝑷 is the 𝑅𝑖 𝑗 -th row of 𝑽 𝑷 .
𝑽 : R𝑑 → R𝑑 is a value function, 𝜎 is a scoring function (e.g. 𝒊𝒋

softmax or hardmax). 𝑗 ∈ { 𝑗 |𝛿


𝑅 (𝑖,𝑗) >0}
Relative position embedding. Eq 2 is a content-only attention 𝛼˜𝑖,𝑗
𝑜˜𝑖 = 𝜎 ( √ )(𝑽 (𝑥 𝑗 ) + 𝑽𝑹𝑷𝒊𝒋 ) (6)
without any position information. The initial Transformer model 3𝑑
𝑗
uses absolute position embeddings to inform about the position.
Shaw et al. [36] proposed replacing them with relative position Note that we only compute the attention weights for node pairs
embeddings, which has shown more effective in code summariza- where 𝛿𝑅 (𝑖, 𝑗) > 0), which is similar to the idea of sliding win-
tion tasks [1]. The relative position 𝛿 (𝑖, 𝑗) reflects the pairwise dow [7] and can reduce the time and space complexity of the self-
distance between 𝑛𝑖 and 𝑛 𝑗 . Denote 𝑃 as the max relative distance, attention process. We will discuss its implementation and analyze
𝛿 (𝑖, 𝑗) ∈ [0, 2𝑃] can be defined as: its complexity in sections 4 and 5 respectively.

⎪ 0 for 𝑖 − 𝑗 ≤ −𝑃

⎨ 4 EFFICIENT IMPLEMENTATION
𝛿 (𝑖, 𝑗) = 2𝑃 for 𝑖−𝑗 ≥𝑃 (3)

⎪ 𝑖 − 𝑗 + 𝑃 others. A limitation of the full attention mechanism in standard Transform-
⎩ ers is the computational and memory cost that grows quadratically
In this way, we can map each relative distance into an embedding with the sequence length. AST-Trans we proposed can alleviate
representation. The relative position embeddings can be added on this problem since the attention scores only need to be computed
top of Eq 2 to inform the pairwise distance. for node pairs where 𝛿𝑅 (𝑖, 𝑗) > 0. Nevertheless, a memory and
Disentangled Attention. Disentangled Attention [16] uses rel- computational efficient implementation of AST-Trans that supports
ative position embedding as bias in self-attention process. Each parallel processing is non-trivial. The essence of AST-Trans is similar
word is represented using two vectors that encode its content and to previous works that apply sliding windows to constrain the at-
relative position in an disentangled way. The attention computa- tention within a fixed range [7, 54]. With sliding windows, the node
tion is then divided into three parts: content-to-content, content- pairs in the sequence data can be planned into a linear distribution
to-position and position-to-content, defined as: (by ignoring node pairs with 𝛿 (𝑖, 𝑗) = 0 or 2𝑃 − 1) and computed
in parallel with matrix partitioning. However, this technique does

𝛼˜𝑖,𝑗 = 𝑸 (𝑥𝑖 )𝑲 (𝑥 𝑗 )  + 𝑸 (𝑥𝑖 )𝑲𝛿𝑷(𝑖,𝑗) + 𝑸 𝜹𝑷 (𝒋,𝒊) 𝑲 (𝑥 𝑗 )  not apply to us since the position distribution of relevant nodes
      (4) changes with every tree structure, which makes matrix blocking
content-to-content content-to-position position-to-content infeasible. In this section, we present the following 5 alternative
implementations of AST-Trans and discuss the pros and cons:
where 𝑸 𝑷 , 𝑲 𝑷 ∈ R (2𝑃 +1)×𝑚 represent the query and key projec- Mask. Mask out the attention scores where 𝛿𝑅 (𝑖, 𝑗) = 0 after
tion matrices of relative positions. 𝑲𝛿𝑷(𝑖,𝑗) is the 𝛿 (𝑖, 𝑗)-th row of computing the full attention among all nodes. It has the same qua-
𝑲 𝑷 and 𝑸𝛿𝑷(𝑖,𝑗) is the 𝛿 (𝑖, 𝑗)-th row of 𝑸 𝑷 respectively. The last two dratic time and space complexity as in the standard Transformer.
items, i.e., content-to-position and position-to-content, are used to Loop. Loop over node pairs where 𝛿𝑅 (𝑖, 𝑗) > 0 and compute the
measure the relative positions between a word pair. attention scores. It is memory and computational efficient but does
Besides, for content-to-position computation, as all possible rel- not support parallel processing.
ative positions are always in [0, 2𝑃], the scores of query content Sparse. We can store 𝛿𝑅 as a sparse tensor 𝑆𝑇 (𝛿𝑅 ) and deep learn-

𝑸 (𝑥) to all key positions 𝑲 𝑷 can be first computed as 𝑸 (𝑥)𝑲 𝑷 , ing frameworks, such as Pytorch, can automatically skip operations
and then gathered into 𝛼˜ with 𝛿 (𝑖, 𝑗) as index. In this way, The with zero elements when multiplying a sparse tensor with a normal
relative position embedding can be reused for all query contents tensor. The mask operation can be optimized (for example, content-
and thus reduce the space complexity to 𝑂 (2𝑃𝑚) . to-position attention scores in Eq 4 can be computed by gathering

Attention with Tree-Structured Relationships. Our method 𝑄 (𝑥)𝐾 𝑃 with 𝑆𝑇 (𝛿𝑅 )). However, it can only apply to content-to-
essentially replaces 𝛿 (𝑖, 𝑗), the relative distance defined under the position and position-to-content. For content-to-content, we still
linear relationship, with 𝛿𝑅 (𝑖, 𝑗) where 𝑅 stands for either the have to use the Mask or Loop strategy since the production of two
ancestor-descendent relationship 𝐴 or the sibling relationship 𝑆 in sparse tensors is not directly supported.
the tree structure. 𝛿𝑅 (𝑖, 𝑗) is defined as: Gather with COO (GC). On the basis of Sparse, the content-
 to-content computation can be optimized by additional gather op-
𝑅𝑖 𝑗 + 𝑃 + 1 if 𝑅𝑖 𝑗 ∈ [−𝑃, 𝑃]
𝛿𝑅 (𝑖, 𝑗) = (5) erations. The core idea of GC is to put query-key pairs that need to
0 if 𝑅𝑖 𝑗 = ∞
be computed into one-to-one correspondence, and store them as
𝑅𝑖 𝑗 refers to either 𝐴𝑖 𝑗 or 𝑆𝑖 𝑗 defined in Eq 1. As there are two kinds dense matrices. Coordinate format (COO) is a common way to store
of relationships, we consider only one relationship in each head so sparse tensors, where only non-zero elements are stored as tuples of

153

Authorized licensed use limited to: China University of Petroleum. Downloaded on April 11,2023 at 12:23:20 UTC from IEEE Xplore. Restrictions apply.
AST-Trans: Code Summarization with Efficient Tree-Structured Attention ICSE ’22, May 21–29, 2022, Pittsburgh, PA, USA

     There are 3 benefits of this approach compared with GC:
• 𝐾 𝑃 and 𝑄 𝑃 can be reused, as each 𝑄𝑟𝑜𝑤𝑠 and 𝐾𝑟𝑜𝑤𝑠 have the
   
same relative distance 𝑠. The position embeddings of 𝑠 can be
        directly added into the content without gather operations.
• Only a quarter of number of gather operation is needed
   
(discussed in 5.3).
• Only one dot production is required, as the second 𝑄𝑠𝑃  𝐾𝑠𝑃
   can be reused and only (𝑄𝑟𝑜𝑤𝑠 + 𝑄𝑠𝑃 )  (𝐾𝑟𝑜𝑤𝑠 + 𝐾𝑠𝑃 ) needs
to be calculated.
       See Appendix A for the complete algorithm.

5 COMPLEXITY ANALYSIS
Figure 3: Decompose the relative distance matrix 𝛿𝑅 of the tree
“abcd" with max relative distance 𝑃 = 1. In this section, we will discuss the best, worst and average complex-
ity of 5 implementations mentioned above. We use FLOPs (floating
point operations) to measure the computational complexity. The
element indices and the corresponding values. Let 𝐶𝑂𝑂𝑟𝑜𝑤 /𝐶𝑂𝑂𝑐𝑜𝑙 considered operations includes: matrix multiplication, matrix dot
denotes the list of row/column indexes, and 𝐶𝑂𝑂 𝑣𝑎𝑙 denotes the production, add and gather operation which are the main operations
list of values in the COO format of 𝛿𝑅 . We then use them as indexes involved for the attention computation. FLOPs of these operations
to gather the query and key of content as: are listed below:
𝑄𝑟𝑜𝑤 = 𝑄 (𝑥) [𝐶𝑂𝑂𝑟𝑜𝑤 ; :]; 𝐾𝑐𝑜𝑙 = 𝐾 (𝑥) [𝐶𝑂𝑂𝑐𝑜𝑙 ; :] 𝐹 𝐿𝑂𝑃𝑠 (𝐴 + 𝐵) = 𝑁 (𝑚 − 1); 𝐹 𝐿𝑂𝑃𝑆 (𝐴[𝐶; :]) = |𝐶 | ∗ 𝑚
𝑃
𝑄 𝑣𝑎𝑙 = 𝑄 𝑃 [𝐶𝑂𝑂 𝑣𝑎𝑙 ; :]; 𝐾𝑣𝑎𝑙
𝑃
= 𝐾 𝑃 [𝐶𝑂𝑂 𝑣𝑎𝑙 ; :] 𝐹 𝐿𝑂𝑃𝑠 (𝐴  𝐵) = 𝑁𝑚 2 + 𝑁 (𝑚 − 1) (7)

By this way, each column in the query content 𝑄𝑟𝑜𝑤 corresponds to 𝐹 𝐿𝑂𝑃𝑠 (𝐴 × 𝐵 ) = 𝑁 ∗ 𝐹 𝐿𝑂𝑃𝑠 (𝐴  𝐵)
the same column in the key content 𝐾𝑐𝑜𝑙 . Then we can use matrix where 𝐴 and 𝐵 are two matrices with shape [𝑁 , 𝑚], 𝐴[𝐶; :] indicates
dot production to compute attention scores: gather 𝐴 with 𝐶 as the index, |𝐶 | is the number of elements in 𝐶.
𝛼𝑐𝑜𝑜 = 𝑄𝑟𝑜𝑤  𝐾𝑐𝑜𝑙 + 𝑄𝑟𝑜𝑤  𝐾𝑣𝑎𝑙 𝑃 𝑃
+ 𝑄 𝑣𝑎𝑙  𝐾𝑐𝑜𝑙 We will focus our analysis on attention heads using the ancestor-
descendent relationship (𝐴), but similar ideas can be used to analyze
where  indicates dot production. 𝛼𝑐𝑜𝑜 is a vector and corresponds the sibling relationship (𝑆) straightforwardly. As the complexity is
to the non-zero values in 𝛼˜ (Eq. 4), and 𝛼˜ [𝐶𝑂𝑂𝑟𝑜𝑤 [𝑖]; 𝐶𝑂𝑂𝑐𝑜𝑙 [𝑖]] = related to the number of non-zero elements in 𝛿𝐴 (denoted with
𝛼𝑐𝑜𝑜 [𝑖]. The content-to-position or position-to-content can be com- |𝛿𝐴 > 0|). We first analyze the range of |𝛿𝐴 > 0|, then present the
puted the same as in Sparse, and the total number of gather opera- complexity of each implementation.
tions in attention computation is 4 times of non-zero elements in
𝛿𝑅 : 2 for gathering the content and 2 for gathering the position. 5.1 Range of |𝛿𝐴 > 0|
Gather with decomposed COO (GDC). To reduce the number Theorem 5.1. For any directed tree 𝑇 , let E(i) represent the number
of gather operations in GC, we can add a matrix decomposition of paths in 𝑇 with length 𝑖, 𝐿 represent the length of the longest path
operation on top of it. First, we decompose 𝛿𝑅 by 𝐶𝑂𝑂 𝑣𝑎𝑙 such that in 𝐺, we have:
each sub-matrix 𝛿𝑅𝑠 contains only node-pairs with the same relative
𝐸 (1) > 𝐸 (2) > · · · > 𝐸 (𝐿)
distance 𝑠. An example is shown in Figure 3, where the original 𝛿𝑅
contains 3 distinct values and we decompose it into 3 sub-matrices Proof. Assuming there are 𝑁 nodes in the tree, and the root
accordingly. We transfer each sub-matrix 𝛿𝑅𝑠 into its COO format node is at level 1. Define 𝑁 𝑗 as the number of nodes at level 𝑗. For
and use 𝐶𝑂𝑂 𝑠 to indicates the sub-matrix with 𝑣𝑎𝑙 = 𝑠. For each each node at level 𝑗, if 𝑗 − 𝑖 > 0, there exists one path of length
sub-matrix 𝐶𝑂𝑂 𝑠 , we gather content embeddings of nodes by: 𝑖 ending with this node, otherwise no such path exists. Hence,
𝑠
𝑄𝑟𝑜𝑤𝑠 = 𝑄 (𝑥) [𝐶𝑂𝑂𝑟𝑜𝑤 𝑠
; :], 𝐾𝑐𝑜𝑙𝑠 = 𝐾 (𝑥) [𝐶𝑂𝑂𝑐𝑜𝑙 ; :] 𝐸 (𝑖) = 𝑁 − 𝑖𝑗=1 𝑁 𝑗 and 𝑁 𝑗 > 0. Therefore we must have 𝐸 (𝑖) >
𝐸 (𝑖 + 1). 
where 𝑄𝑟𝑜𝑤𝑠 indicates the query content ordered by 𝐶𝑂𝑂𝑟𝑜𝑤𝑠 , and
𝑠
𝐾𝑐𝑜𝑙𝑠 represents the key content ordered by 𝐶𝑂𝑂𝑐𝑜𝑙 . The attention Theorem 5.2. Every tree with 𝑁 nodes has exactly 𝑁 − 1 edges.
scores can then be computed as:
Proof. Imagine starting with 𝑁 isolated nodes and adding edges
𝛼𝑐𝑜𝑜𝑠 = (𝑄𝑟𝑜𝑤𝑠 + 𝑄𝑠𝑃 )  (𝐾𝑟𝑜𝑤𝑠 + 𝐾𝑠𝑃 ) − (𝑄𝑠𝑃  𝐾𝑠𝑃 ) one at a time. By adding one edge, we will either (1) connect two
components together, or (2) close a circuit. Since a tree is fully
where 𝛼𝑐𝑜𝑜𝑠 corresponds to the attention scores of node pairs in
connected and has no circuit, we must add exactly 𝑁 − 1 edges. 
𝛿𝑅𝑠 . Note that 𝛼𝑐𝑜𝑜𝑠 is a vector of the same shape as 𝐶𝑂𝑂𝑟𝑜𝑤𝑠 . By
𝑠
padding all 𝐶𝑂𝑂 to the same length, the attention scores can be Least upper & Greatest lower bound. Let 𝐸 (0) = 𝑁 denote
computed in parallel and the final attention scores equal to the sum the number of nodes in a tree. We have |𝛿𝐴 > 0| = 𝐸 (0) + 2(𝐸 (1) +
of all 𝛼𝑐𝑜𝑜𝑠 : 𝐸 (2) + . . . 𝐸 (𝑃)) since we consider both positive and negative dis-

2𝑃+1 tance in 𝛿𝐴 . Based on the above two theorems, we can have:
𝛼𝑐𝑜𝑜 = 𝛼𝑐𝑜𝑜𝑠
𝑠=1 𝐸 (𝑖) ≤ 𝐸 (𝑖 − 1) − 1 ≤ . . . 𝐸 (0) − 𝑖 = 𝑁 − 𝑖

154

Authorized licensed use limited to: China University of Petroleum. Downloaded on April 11,2023 at 12:23:20 UTC from IEEE Xplore. Restrictions apply.
ICSE ’22, May 21–29, 2022, Pittsburgh, PA, USA Ze Tang, Xiaoyu Shen, Chuanyi Li, Jidong Ge, Liguo Huang, and Zhelin Zhu, Bin Luo

   

   


  

Figure 4: |𝛿𝐴 > 0| in case of random trees, the abscissa is the max Figure 5: Theoretical complexity with 𝑃 = 5, 𝑚 = 32. loop has the
relative distance 𝑃 and the ordinate is the non-zero elements in 𝛿𝐴 lowest complexity but cannot be parallelized in practice.
with the unit of 𝑂 (𝑁 ). The coefficient decreases with growing 𝑃.

Table 2: Statistics of Java and Python Datasets


|𝛿𝐴 > 0| ≤ 𝑁 + 2(𝑁 − 1 + 𝑁 − 2 + . . . 𝑁 − 𝑃) = (𝑁 − 𝑃)(2𝑃 + 1)
It is the least upper bound for the ancestor-descendent relationship Perspectives Java Python
and is achieved only when each node has strictly one child node. # of Train instances 69,708 55,538
The greatest lower bound can be achieved when the tree’s depth is # of Validation instances 8,714 18,505
2. In this situation, 𝐸 (𝑖) = 0 for 𝑖 ≥ 2 and |𝛿𝐴 > 0| = 3𝑁 − 2. # of Test instances 8,714 18,502
Average. We can use the Prüfer sequence [35] to simulate ran- Avg. # of tokens in code 120 48
dom trees so we can estimate the average of |𝛿𝐴 > 0| with different Avg. # of nodes in AST 158 100
tree structures. The tree size 𝑁 is set in the range of [50, 500] and Avg. # of tokens in SBT 632 402
the out-degree of each node is randomly selected from 1 to 𝑁 − 1 Avg. # of tokens in summary 18 9
(controlled by the max value in Prüfer sequence). We did 1,000
simulation experiments and Figure 4 shows the result.
The average |𝛿𝐴 > 0| when 𝑃 is sampled from a uniform distri- (as 𝐶𝑂𝑂𝑟𝑜𝑤𝑠 is the same index of query), and when 𝑠 < 𝑃 +1, the key
bution in [1, 50] is 1.16𝑃𝑁 . We can see that the coefficient in Figure 4 content does not need to be gathered. Hence, we only need (2𝑃 +1)𝑁
gradually decreases. For larger 𝑃, the average |𝛿𝐴 > 0| will be much gather operations from content. Secondly, padding positions do not
smaller than the upper bound of (2𝑃 + 1)(𝑁 − 𝑃). need to be computed in dot production as the padding positions
of both 𝑄𝑟𝑜𝑤𝑠 and 𝐾𝑟𝑜𝑤𝑠 are the same. After adding the position
5.2 Mask & Loop & Sparse & GC bias, all 𝑄𝑟𝑜𝑤𝑠 and 𝐾𝑟𝑜𝑤𝑠 can be packed before dot production, then
Mask contains 1 matrix multiplication with [𝑁 , 𝑚] × [𝑚, 𝑁 ] in unpacked to their original length afterwards. By this way, we only
content-to-content, 2 matrix multiplication with [𝑁 , 𝑚] × [𝑚, 2𝑃 +1] need to compute related node pairs with one dot production.
and 2 gather operations with index shape [𝑁 , 𝑁 ] for content-to- In consequence, the complexity of GDC includes (2𝑃 + 1)𝑁𝑚
position and position-to-content, and 2 add operations are used gather operations, 1 dot production with shape [|𝛿𝐴 > 0|, 𝑚] and
for final score computation. The complexity is (𝑁 2 + (2𝑃 + 1)𝑁 ) ∗ 3 add operations with shape [|𝛿𝐴 > 0|], which equals to |𝛿𝐴 >
(𝑚 2 + 𝑚 − 1) + 2𝑁 2 + 𝑁 − 1. 0|(𝑚 2 + 𝑚 − 1) + (6𝑃 + 3)𝑁𝑚 + (2𝑃 + 1)𝑁 .
Loop As loop only computes non-zero elements in 𝛿𝐴 , the com- For better comparison, we also show the theoretical complexity
plexity includes 1 dot production of |𝛿𝐴 > 0|(𝑚 2 +𝑚 − 1) and 2 add in Figure 5 under the hyper-parameters in our experiments. As can
operations |𝛿𝐴 > 0| ∗ 2(𝑚 − 1), and equals to |𝛿𝐴 > 0|(𝑚 2 + 3𝑚 − 3). be seen, loop has the lowest complexity but cannot be parallelized.
Sparse’s complexity is same as Mask apart from the gather opera- mask and sparse grow quadratically with the AST size. GDC
tion with index shape |𝛿𝐴 > 0| (the time complexity for gathering slightly outperforms GC and has a complexity close to loop.
sparse tensor as index equals to the number of non-zero elements in
it), which equals to (𝑁 2 + (2𝑃 +1)𝑁 ) ∗ (𝑚 2 +𝑚 −1) +2|𝛿𝐴 > 0| +𝑁 −1. 6 EXPERIMENTS
GC The complexity in GC is all related to |𝛿𝐴 > 0|. It contains 4 In this section, we first explain the experimental setup, evaluation
gather operations, 3 dot production and 2 add operations, which metrics and baseline approaches, then report the main results and
leads to the complexity of |𝛿𝐴 > 0|(𝑚 2 + 3𝑚 + 4) + 2(2𝑃 + 1)𝑁𝑚. perform ablation studies. The runtime speed and memory cost of
different implementations are provided for comparison. Finally, we
5.3 GDC present a qualitative analysis and discuss the future directions.
There are two implementation details in GDC to optimize the time
and space complexity. Firstly, in a tree, if 𝑠 ≥ 𝑃 + 1, the decomposed 6.1 Experimental Setup
sub-matrix 𝐶𝑂𝑂 𝑠 has at most one non-zero value in each row. Datasets. Experiments are conducted on the two public code sum-
(for example, each non-root node has exactly one parent node in marization benchmarks, one in Java [19] and the other in Python [51].
Figure 3.) We can fix 𝐶𝑂𝑂𝑟𝑜𝑤 𝑠 to [0, 1, . . . , 𝑁 − 1] and only store To ensure the quality of comments, we filter the comments with
the corresponding 𝐶𝑂𝑂𝑐𝑜𝑙 𝑠 . When 𝑠 < 𝑃 + 1, as the relationship is
less than 4 words, constructors, setters, getters, and tester methods,
symmetric, 𝐶𝑂𝑂 𝑠 can be represented with 𝐶𝑂𝑂 2𝑃+2−𝑠 . Based on same as in Shido et al. [41]. When the comment has two or more
this, when 𝑠 ≥ 𝑃 + 1, the query content does not need to be gathered sentences, only the first sentence is kept as the description of the

155

Authorized licensed use limited to: China University of Petroleum. Downloaded on April 11,2023 at 12:23:20 UTC from IEEE Xplore. Restrictions apply.
AST-Trans: Code Summarization with Efficient Tree-Structured Attention ICSE ’22, May 21–29, 2022, Pittsburgh, PA, USA

Table 3: Comparison of AST-Trans with the baseline methods, categorized based on the input type. * means implemented by ourselves.

Java Python
Methods Input
BLEU (%) METEOR (%) ROUGE-L (%) BLEU (%) METEOR (%) ROUGE-L (%)
CODE-NN[20] 27.6 12.61 41.10 17.36 09.29 37.81
API+CODE[19] 41.31 23.73 52.25 15.36 08.57 33.65
Dual Model[53] Code 42.39 25.77 53.61 21.80 11.14 39.45
BaseTrans*[1] 44.58 29.12 53.63 25.77 16.33 38.95
Code-Transformer*[57] 45.74 29.65 54.96 30.93 18.42 43.67
Tree2Seq[11] 37.88 22.55 51.50 20.07 08.96 35.64
RL+Hybrid2Seq[51] 38.22 22.75 51.91 19.28 09.75 39.34
GCN*[22] AST(Tree) 43.94 28.92 55.45 32.31 19.54 39.67
GAT*[50] 44.63 29.19 55.84 32.16 19.30 39.12
Graph-Transformer*[40] 44.68 29.29 54.98 32.55 19.58 39.66
Code2Seq*[4] 24.42 15.35 33.95 17.54 08.49 20.93
AST(PD)
Code2Seq(Transformer)* 35.08 21.69 42.77 29.79 16.73 40.59
DeepCom[18] 39.75 23.06 52.67 20.78 09.98 37.35
Transformer(SBT)* AST(SBT) 43.37 28.36 52.37 31.33 19.02 44.09
AST-Trans(SBT)* 44.15 29.58 54.73 32.86 19.89 45.92
Transformer(POT)* 39.62 26.30 50.63 31.86 19.63 44.73
AST(POT)
AST-Trans 48.29 30.94 55.85 34.72 20.71 47.77

AdamW optimizer [28] with 𝑙𝑟 = 1𝑒−3, 𝛽 1 = 0.9, 𝛽 2 = 0.999,𝜃 = 1𝑒−


6, label smoothing with 𝜃𝑙𝑠 = 0.1 [46] and dropout probability [44]
of 0.2. The patience in the early stopping mechanism [32] is set to
20 and we select the model based on the BLEU in the validation set
4.
Evaluation Metrics. We evaluate the performance with corpus
BLEU [33], METEOR [6], and ROUGE-L [27].
The experiments used the GPUs provided by Aliyun, which use
EFLOPS [9] architecture and ACCL [10]. EFlops architecture im-
Figure 6: Distribution of relative distance 𝑝 in training sets
proves the scalability and efficiency of commodilty clusters (CoW),
and ACCL bring the performant efficiency of EFlops architecture
method. Table 2 shows the statistics of the datasets. We also count to general cluster systems and Cloud scenarios.
the distribution of relative distances in Fig 6. As can be seen, most
ancestor-descendent and sibling relationships are within the range 6.2 Baselines
of 5 and 10 respectively. We compare the proposed AST-Transformer with 16 baseline meth-
Pre-processing. First, we pre-process the summaries by removing ods. They can be divided into 5 groups based on the input type:
the punctuations. Next, we split multi-words, such as “gettable- 1: Code. Models with the code as input. It treats code as plain
types", in summaries with wordninja 3 since their corresponding text and does not leverage ASTs. Code-NN [20] used RNN while
tokens in the source code are split too [53]. We also split the leaf BaseTrans [1] used the Transformer. On the basis of Code-NN,
nodes in ASTs into sub-tokens if they are in form of the CamelCase Dual Model[53] used dual learning to train code summarization
or snake_case. The split nodes are treated as new children of the and generation together. API+CODE [19] used multi encoders
original parent node. Finally, we reverse the children of the root to encode code along with the API call sequence. To make up
node to prevent the important information, such as function names for the lack of structural information, Code-Transformer [57]
or parameters, from being cut when the size of input AST exceeds additionally adds four structure distances, including two kinds of
the maximum size allowed. distance mentioned in Sec 3.2, to the code tokens and does attention
Hyper-parameters. If not specified, the maximum size of AST computation separately for each kind of distance. Differently, it
is set to 200 for all experiments, and the vocabulary sizes of both does not distinguish embeddings of different relations and uses sine
ASTs and comments are set to 30, 000. We use 4 layers of stacked and cosine functions to represent distance embeddings.
encoder-decoder and set the hidden size 𝑑 = 256, 𝑚 = 32. For 2: AST(Tree). Models with the AST as input and encode it with
each attention layer, we set ℎ𝐴 = 1 and ℎ𝑆 = 7. The max relative tree-specific encoders. There are two main types of such encoders.
distance for ancestor-descendant/sibling relationship 𝑃𝐴 is set to One uses Tree-LSTM, such as Tree2Seq [11] and RL+Hybrid2Seq [51].
10/5 respectively. Feed-forward inner-layer dimension is 2048 and RL+Hybrid2Seq adds the code information and deep reinforce-
the activation function is gelu [17]. While training, the batch size is ment for training. The other treats the AST as graph and encodes
128 and the maximum epochs is 500. Models are trained using the
4 We also report the results with best METEOR and ROUGE-L in the validation set in
3 https://github.com/keredson/wordninja Appendix B

156

Authorized licensed use limited to: China University of Petroleum. Downloaded on April 11,2023 at 12:23:20 UTC from IEEE Xplore. Restrictions apply.
ICSE ’22, May 21–29, 2022, Pittsburgh, PA, USA Ze Tang, Xiaoyu Shen, Chuanyi Li, Jidong Ge, Liguo Huang, and Zhelin Zhu, Bin Luo

it with graph neural network (GNN) models. We consider three Table 4: Ablation study on AST-Trans with/without 𝐴 and 𝑆.
kinds of GNN models including GCN [22], GAT[50] and Graph-
Transformer [40]. The edges fed to GNN includes the ancestor- Model Dataset BLEU (%) METEOR (%) ROUGE (%)
descendant and sibling edges, distinguished by the edge attributes. AST-Trans w/o A 47.74 30.21 54.56
AST-Trans w/o S Java 48.07 30.62 55.29
3: AST(PD). Models with the AST linearized with path decom- AST-Trans 48.29 30.94 55.85
position as input. Path representation needs to be encoded from AST-Trans w/o A 34.35 20.15 46.62
the nodes, then the whole AST representation is encoded from AST-Trans w/o S Python 34.32 20.28 46.87
AST-Trans 34.72 20.71 47.77
the path representations. Code2Seq [4] is the first approach us-
ing PD, and it used two LSTM models to encode hierarchical net-
works. For fairness of comparison, we also design a new baseline
Code2Seq(Transformer) by replacing these two LSTM models Table 5: Ablation study on ℎ𝐴 and ℎ𝑆 on Java Dataset.
with the Transformer.
4: AST(SBT). Models with the AST linearized with Structure- ℎ𝐴 ℎ𝑆 BLEU (%) METEOR (%) ROUGE-L (%)
based Traversal as input. DeepCom [18] is the first work that uses 0 8 47.74 30.21 54.56
AST (SBT) as input, which encodes it with LSTM. We design a new 1 7 48.29 30.94 55.85
2 6 48.28 30.94 55.64
baseline Transformer (SBT) that encodes AST (SBT) with the
3 5 48.25 30.92 55.66
Transformer. AST-Trans(SBT) is our proposed model that inputs 4 4 48.23 30.96 55.68
SBT with relationship matrices. 5 3 48.11 30.93 55.46
5: AST(POT). Models with the AST linearized with pre-order- 6 2 48.1 30.74 55.22
traversal as input. Transformer (POT) is the standard Trans- 7 1 48.24 30.91 55.57
former architecture with AST (POT) as input and AST-Trans is 8 0 48.07 30.62 55.29
our proposed model with tree-structured attention.
All Transformer-based models are based on the relative position
embeddings with disentangled attention mentioned in Section 3.3
with the same number of parameters. The same hype-parameters are among these three linearization methods. Using the AST (PD) as
used through the way for a fully fair comparison. input leads to poor performance on both datasets. There are two main
reasons. On the one hand, AST(PD) method was first proposed for
6.3 Main Results method name completion. Method names are much shorter than the
The main result of AST-Trans and the baselines are presented in code summaries, and do not include many details. PD linearization
Table 3 5 . AST-Trans outperforms all the baselines on all the three extracts features from paths, which aggregates high-level charac-
metrics. Specifically, it outperforms the best baseline by 3.61, 2.17 ters but ignores the detailed information in the node. However, code
in BLEU, 1.65, 1.08 in METEOR and 0.87, 3.04 in ROUGE-L on the summarization requires more detailed information in the code such
Java and Python datasets respectively. as the type of the return value, which is stored in the leaf nodes. On
Code vs AST (Tree) vs AST (linearized). Apart from AST- the other hand, Code2Seq(Transformer) uses a hierarchical network
Trans, on both two datasets, using GNNs to encode AST (Tree) achieved and the amount of trained parameters is much larger. It is thereby
the best results. The reason is that the AST has both structural and harder to converge than Transformer(SBT) and Transformer(POT).
semantic information, and the other two input types both lose part Impact of relationship matrix 𝑅. We compared the perfor-
of the structural information. All three variants of GNNs achieve mance of three kinds of inputs with or without the relation matrix 𝑅:
similar results and outperform the Tree-LSTM in encoding the AST Code-Transformer vs BaseTrans, AST-Trans (SBT) vs Transformer
(Tree). Compared with taking the linearized AST as input, models (SBT) and AST-Trans (POT) vs Transformer(POT). Results show
only using the code perform better on the Java dataset but worse on that adding 𝑅 improves the performance for all these inputs and AST-
the Python dataset. This could be related to the code length. As code Trans (POT) performs the best. This is because Code-Transformer
and corresponding ASTs in Python are relatively shorter, encoding ignores non-leaf node information, and AST-Trans (SBT) stores
ASTs is more effective than in the Java dataset. Therefore, mod- duplicate information, resulting in too long sequence length. AST-
els using linearized ASTs, with the help of additional structural Trans (POT) maintains a short sequence length without losing
information, are able to outperform models using only the code. necessary structural or semantic information.
AST(PD) vs AST(SBT) vs AST(POT). Among three lineariza- AST-Trans vs GNN. AST-Trans outperforms GNNs, the best-
tion methods, when using the Transformer encoder/decoders, AST performed baseline model in both datasets. With the help of rela-
(SBT) performs the best on the Java dataset and AST (POT) performs tionship matrix, AST-Trans includes additional relative distance
the best on the Python dataset. AST(SBT) and AST(POT) both have information. Nodes can perceive information from its 𝑝-distance
their own advantages. AST(SBT) maintains more structural infor- neighbors at each layer. For GNN, however, each node needs 𝑝
mation than AST(POT) while AST(POT) has the shortest length hops to propagate information from these neighbors. In addition,
AST-Trans uses multi-head mechanism to compute different rela-
5 The results of BaseTrans [1] in the Python dataset are lower than reported in the paper tionships in different heads, while all relationships, distinguished by
(-6.75 BLEU, -3.44 METEOR and -7.78 ROUGE), then we set max relative distance 𝑃 to edge attribute, are calculated together in GNNs. AST-Trans also uses
16 (kept the same as original paper) and get 27.27(-5.25) BLEU, 15.90(-3.87) METEOR,
38.58(-8.15) ROUGE-L. This reduction may be because that we additionally segment extra feed-forward layers and residual connections in the encoder,
multi-words in comments. which could help improve the model generalization.

157

Authorized licensed use limited to: China University of Petroleum. Downloaded on April 11,2023 at 12:23:20 UTC from IEEE Xplore. Restrictions apply.
AST-Trans: Code Summarization with Efficient Tree-Structured Attention ICSE ’22, May 21–29, 2022, Pittsburgh, PA, USA

Table 6: Ablation study on 𝑃𝐴 and 𝑃𝑆 on Java Dataset.    

𝑃𝐴 𝑃𝑆 BLEU (%) METEOR (%) ROUGE-L (%)


0 0 36.34 23.83 45.58
1 1 46.95 30.33 54.24



 
5 1 47.45 30.11 54.28
5 3 47.82 30.29 54.62
5 5 48.14 30.77 55.45
10 5 48.29 30.94 55.85

Table 7: Ablation study on the number of layers on Java Dataset.


   
𝑛𝑢𝑚 BLEU (%) METEOR (%) ROUGE-L (%)
1 46.11 29.36 53.07 Figure 7: Runtime and memory cost of five implementations with
batch size=16. The cost of the mask implementation is equal to the
2 47.68 30.53 54.97
standard Transformer, which grows quadratically with the AST size.
3 47.41 30.04 54.07
4 48.29 30.94 55.85
5 47.8 30.39 54.61
6 48.31 30.58 55.09 Max relative distance We analyze the impact of the max rela-
tive distance 𝑃 in Table 6 . According to Table 6, the out-degree and
depth of most nodes in AST is in [0, 5] and [0, 10]. Therefore, the
6.4 Ablation studies max relative distance of ancestor-descendant (𝑃𝐴 ) and sibling rela-
We conducted ablation studies on four hyper-parameters: use of tion (𝑃𝑆 ) are selected from [1, 5, 10] and [1, 3, 5] respectively. Results
each relationship, number of heads used for ancestor-descendant show that as the relative distance grows, the performance improves
(ℎ𝐴 ) and sibling relationships (ℎ𝑆 ), max relative distance 𝑃 and the too, suggesting a wider view of nodes in AST relationships is help-
number of layers. In every study, apart from the hype-parameter ful. However, the improvement is marginal and even with 𝑃 = 1,
that needs to be analyzed, we keep the rest settings unchanged. the model performance can already outperform all other baselines.
Use of two relationships. We verified the impact of using This might be ascribed to the multi-layer stacked encoders. Even
ancestor-descendant or sibling relationship separately in Table 4. for 𝑃 = 1, longer-distance nodes can still be attended to indirectly
Results show that the performance is achieved when using them on upper layers. In practice, 𝑃 can be set as a hyperparameter to
all. However, using one of the relationships alone can already achieve balance the performance-efficiency trade-off.
close results and outperform all previous baselines. Number of Layers Finally, we perform ablation study by vary-
Number of attention heads. We change the number of heads ing the number of layers, and the results are presented in Table 7.
used for the ancestor-descendant relationship ℎ𝐴 from 0 to 8 and fix In our experiments, we observe that a deeper model (more layers)
the total number of heads to 8. As can be seem from Table 5, the best performs better, but the improvement saturates after 4 layers.
performance is obtained with ℎ𝐴 = 1 and ℎ𝑆 = 7, but there is no
significant difference among all combinations of ℎ𝐴 and ℎ𝑆 . Even 6.5 Complexity analysis
when one relationship is missing (ℎ𝐴 = 0 or ℎ𝑆 = 0), the effects In Fig 7, We analyzed the rum time and memory usage of different
are still marginal. However, when both relationships are removed implementations mentioned in section 4. Different from the theoret-
ℎ𝐴 = ℎ𝑆 = 0, the performance drops a lot. We conjecture that this ical complexity which analyze the attention computation in isolate,
phenomenon is related to the characteristics of AST. Knowing about operations in GPU can be computed in parallel, and there are other
one relationship can help the model “guess" the other relationship factors, e.g. decoder parameters, dependent libraries, vocabulary
properly. For example, the node “Compare" can be the child node of embeddings that all need memory usage. Therefore, the need for
“WhileExp”, “IFExp” or “SwitchExp”, etc, but when it is the sibling computing attention scores is only one part of it and leads to the gap
of node “Case”, it can only be the child of node “SwitchExp”. The between Fig 7 and 5, where the difference across implementations
information about its parent can be “guessed" in attention compu- in Fig 7 is much larger. Nevertheless, the trend stays the same. Time
tation with its sibling “Case”. Similarly, node “NameStore” can only and memory usage of GDC and GC both scale linearly with the
appear on the left side of a statement, and nodes with the same AST size, while the cost of Mask and Sparse grows quadratically.
parent as it must be its right siblings. Messages of these siblings can Even with the batched parallelism in GPUs, the implementation
be passed to “NameStore” through their common parent. However, of mask and sparse are still slower than GDC and GC while re-
there are many cases that the “guess" will not be successful. For quiring significantly more memory cost. GDC is faster and with
example, statements 𝑎 > 𝑏 and 𝑏 > 𝑎 have the same child nodes less memory usage than GC. The main reason is that GDC uses
and can only be distinguished by sibling relationship, while state- one quarter of gather operations compared with GC. Loop shows
ments 𝑎 = 𝑏 + 𝑎; 𝑏 = 𝑏 − 𝑎 and 𝑏 = 𝑏 − 𝑎; 𝑎 = 𝑏 + 𝑎 only differ in a linear growth in memory usage with AST size, but its time cost
ancestor-descendant relationship. It could be that the testset does is much higher as it does not support parallel operations. When
not have enough hard examples that need this fine-grained distinction the AST size grows further, we can expect the difference across
or the current metrics are not enough to reflect the difference. implementations will become larger and larger.

158

Authorized licensed use limited to: China University of Petroleum. Downloaded on April 11,2023 at 12:23:20 UTC from IEEE Xplore. Restrictions apply.
ICSE ’22, May 21–29, 2022, Pittsburgh, PA, USA Ze Tang, Xiaoyu Shen, Chuanyi Li, Jidong Ge, Liguo Huang, and Zhelin Zhu, Bin Luo

  Table 8: Qualitative examples.


 

public QuickActionView addActions(Collection <Action> actions){

checkShown();
 
mActions.addAll(actions);
 return this;
 }
  AST-Trans w/o S: adds a sub - action to the menu
 AST-Trans w/o A: adds the given actions to the list of actions
 AST-Trans: adds a collection of actions to the quick action view
Human Written: adds a collection of actions to the quick action view

public java.lang.Object newInstance() {


Object o = newInstanceImpl();
if(o == null){
throw new InstantiationException();
Figure 8: Heatmaps of relative position representations. x-axis is }
the relative position representation and the y-axis is the relative return o;
positions. The variance for the sibling relation (𝑆) is much larger }
than that for the ancestor-descendent relation (𝐴). AST-Trans w/o S: creates a new object initialized to the string object
AST-Trans w/o A: returns a new instance of the object class
AST-Trans: returns a new instance of the object
Human Written: creates a new instance of a class

def job_delete_by_tag(tag):
6.6 Visualization and Qualitative Analysis Job.objects.get(tag=tag).delete()
return (job_get_by_tag(tag) is None)
Visualization. We further visualize the relative position represen-
tations of ancestor-descendant (𝐴) and sibling (𝑆) relationships in AST-Trans w/o S: delete a job and return tag
AST-Trans w/o A: delete a job objects
Fig 8. As can be seen, the variance of relative position embeddings
AST-Trans: delete a job based on its tag
in 𝑆 is much larger than in 𝐴. It implies that our model is not sensi-
Human Written: deletes a job entry based on its tag
tive to the relative distance between ancestor and descendant nodes,
as the embeddings are almost the same regardless of the positions.
In contrast, the variance for sibling nodes is relatively large, and We select two widely used ones to evaluate the proposed AST-
the model can distinguish the sibling nodes with different relative Transformer, but they may not be representative of other program-
distances. In addition, the relative embeddings in 𝐴 are demarcated ming languages. Secondly, to ensure a fair comparison as much as
between the upper and lower part, suggesting a clear distinction possible, we build baselines on the top of the same Transformer
between ancestor and descendant nodes. It shows that our model architecture. The architecture and hyperparameter choice might be
pays more attention to direction rather than distance in 𝐴. It is likely sub-optimal for certain approaches 6 . Finally, there will be a certain
that the exact distance between sibling nodes are more important gap between the automatic evaluation and the manual evaluation
than that between ancestor-descendant nodes in ASTs. of the summarization results. We select three different automatic
Qualitative analysis. We provide a couple of examples for qualita- evaluation methods to avoid bias as much as possible.
tive analysis in Table 8. It can be observed that AST-Trans generates
the closest summary to the reference, and lack of 𝐴 or 𝑆 hurts the 8 RELATED WORKS
quality of summarization. In the first case, the key information is Code Summarization. Most approaches on code summarization
the connection between the sibling nodes method call (“addAll”) frame the problem as a sequence generation task and use an encoder-
and parameter (“actions”). Both AST-Trans and AST-Trans w/o 𝐴 decoder architecture. The only difference between it and traditional
generates the summary as a batch add operation, while AST-Trans machine translation is that programming languages are unam-
w/o 𝑆 misunderstands it as “adds an action”. On the contrary, the biguous and follow rigid grammar rules. Most approaches either
meaning of the third case is to get job by the tag first then delete treat the source code as natural language (i.e., a sequence of to-
it. The order of execution is controlled by the ancestor-descent kens without specified structures), or utilize its structural informa-
relationship (the method call “get” is the child node of “delete”), and tion with the help from ASTs or other parsed forms. To encode
AST-Trans w/o 𝐴 just ignores the “get” operation. The summaries of the code sequence, there exist many encoder architectures like
AST-Trans w/o 𝐴 and w/o 𝑆 are both correct in the second case. The CNN [3], RNN [20, 55] and the Transformer [1]. To leverage the
statements of the second case are relatively simple and ignoring tree-structured AST, tree-based models such as Recursive NN [26],
the order of statements will not affect the function comprehension. Tree-LSTM [41, 51] and Tree-Transformer [15, 52], are used to en-
code AST directly. As tree is a special kind of graph, graph-based
approaches [2, 12, 23] can also be used to encode ASTs. Some works
7 THREATS TO VALIDITY also combine the code token sequence with the AST and observe
improvement [23–25]. Our approach only needs the linearized AST
There are three main threats to the validity of our evaluation. Firstly,
many public datasets are proposed to explore code summarization. 6 Nevertheless, AST-Trans performs best among all reported results on both datasets.

159

Authorized licensed use limited to: China University of Petroleum. Downloaded on April 11,2023 at 12:23:20 UTC from IEEE Xplore. Restrictions apply.
AST-Trans: Code Summarization with Efficient Tree-Structured Attention ICSE ’22, May 21–29, 2022, Pittsburgh, PA, USA

and can be built upon the Transformer architecture. More impor- [2] Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2018. Learning
tantly, it restricts the attention range and makes it possible to encode to Represent Programs with Graphs. In 6th International Conference on Learning
Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Con-
very long AST sequences. ference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=
Tree-based Neural Networks. The existing tree-based neural net- BJOFETxR-
[3] Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016. A Convolutional
works can be grouped into two categories depending on their inputs: Attention Network for Extreme Summarization of Source Code. In Proceedings of
(1) The models that directly take the tree as input [15, 31, 34, 47]. the 33nd International Conference on Machine Learning, ICML 2016, New York City,
These models are strongly coupled with the tree structure, and the NY, USA, June 19-24, 2016 (JMLR Workshop and Conference Proceedings, Vol. 48),
Maria-Florina Balcan and Kilian Q. Weinberger (Eds.). JMLR.org, 2091–2100.
calculation process needs to be performed simultaneously with http://proceedings.mlr.press/v48/allamanis16.html
the tree traversal. Since trees generally have different shapes by [4] Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. 2019. code2seq: Generating
nature, parallization of training these models is non-trivial. (2) The Sequences from Structured Representations of Code. In 7th International Con-
ference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9,
models that take the sequence(s) extracted from the tree as input, 2019. OpenReview.net. https://openreview.net/forum?id=H1gKYo09tX
such as the sampled paths in the tree [4, 21], the traversal sequence [5] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine
Translation by Jointly Learning to Align and Translate. In 3rd International
with tree positional embedding [42] or the structure based traver- Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May
sal (SBT) sequence [18]. Taking sampled paths as input is with a 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.).
certain degree of randomness and instability, and the method of http://arxiv.org/abs/1409.0473
[6] Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT
tree positional embedding ignores the concept of paths in the tree Evaluation with Improved Correlation with Human Judgments. In Proceedings of
(all nodes, even if not related, will participate in the calculation the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Transla-
together). Our method improves from these two methods, which tion and/or Summarization@ACL 2005, Ann Arbor, Michigan, USA, June 29, 2005,
Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare R. Voss (Eds.). Association
can be guaranteed that each node exchanges message on and only for Computational Linguistics, 65–72. https://www.aclweb.org/anthology/W05-
on all paths containing it. 0909/
[7] Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long-
Document Transformer. CoRR abs/2004.05150 (2020). arXiv:2004.05150 https:
9 CONCLUSION //arxiv.org/abs/2004.05150
[8] Ernie Chang, Xiaoyu Shen, Hui-Syuan Yeh, and Vera Demberg. 2021. On Training
In this paper, we present AST-Trans which can encode ASTs effec- Instance Selection for Few-Shot Neural Text Generation. In Proceedings of the
tively for code summarization. In AST-Trans, each node only pays 59th Annual Meeting of the Association for Computational Linguistics and the 11th
International Joint Conference on Natural Language Processing (Volume 2: Short
attention to nodes which share the ancestor-descendent or sibling Papers). 8–13.
relationships with it. It brings two benefits: (1) the model is given [9] Jianbo Dong, Zheng Cao, Tao Zhang, Jianxi Ye, Shaochuang Wang, Fei Feng, Li
Zhao, Xiaoyong Liu, Liuyihan Song, Liwei Peng, et al. 2020. Eflops: Algorithm
an inductive bias and will not get lost in the overlong AST sequence, and system co-design for a high performance distributed training platform. In
and (2) it can reduce the computational complexity from quadratic 2020 IEEE International Symposium on High Performance Computer Architecture
to linear. The latter makes it possible to encode long code sequence, (HPCA). IEEE, 610–622.
[10] Jianbo Dong, Shaochuang Wang, Fei Feng, Zheng Cao, Heng Pan, Lingbo Tang,
e.g., a whole file, which is prohibitively expensive for standard Pengcheng Li, Hao Li, Qianyuan Ran, Yiqun Guo, et al. 2021. ACCL: Architecting
Transformers. We conduct comprehensive experiments, showing Highly Scalable Distributed Training Systems with Highly-Efficient Collective
that AST-Trans achieve SOTA results on two popular benchmarks Communication Library. IEEE Micro (2021).
[11] Akiko Eriguchi, Kazuma Hashimoto, and Yoshimasa Tsuruoka. 2016. Tree-to-
while significantly reducing the computational cost. Sequence Attentional Neural Machine Translation. In Proceedings of the 54th
We believe the basic idea of AST-Trans can also be applied in Annual Meeting of the Association for Computational Linguistics, ACL 2016, August
7-12, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer
other structured data like data dependence and control flow graphs. Linguistics. https://doi.org/10.18653/v1/p16-1078
The code is made publicly available to benefit the relevant research. [12] Patrick Fernandes, Miltiadis Allamanis, and Marc Brockschmidt. 2019. Struc-
In future work, we plan to improve AST-Trans by incorporating tured Neural Summarization. In 7th International Conference on Learning Rep-
resentations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
more features of the code snippet, such as API sequence and node https://openreview.net/forum?id=H1ersoRqtm
type, into the self-attention mechanism. [13] Sonia Haiduc, Jairo Aponte, and Andrian Marcus. 2010. Supporting program
comprehension with source code summarization. In Proceedings of the 32nd
ACM/IEEE International Conference on Software Engineering - Volume 2, ICSE 2010,
10 ACKNOWLEDGMENTS Cape Town, South Africa, 1-8 May 2010, Jeff Kramer, Judith Bishop, Premkumar T.
Devanbu, and Sebastián Uchitel (Eds.). ACM, 223–226. https://doi.org/10.1145/
This work is supported by National Natural Science Foundation of 1810295.1810335
China (61802167,61802095) ,Natural Science Foundation of Jiangsu [14] Sonia Haiduc, Jairo Aponte, Laura Moreno, and Andrian Marcus. 2010. On the
Use of Automated Text Summarization Techniques for Summarizing Source Code.
Province (No.BK20201250),Cooperation Fund of Huawei-NJU Cre- In 17th Working Conference on Reverse Engineering, WCRE 2010, 13-16 October
ative Laboratory for the Next Programming, and NSF award 2034508. 2010, Beverly, MA, USA, Giuliano Antoniol, Martin Pinzger, and Elliot J. Chikofsky
We thank Alibaba Cloud for its high-efficient AI computing service (Eds.). IEEE Computer Society, 35–44. https://doi.org/10.1109/WCRE.2010.13
[15] Jacob Harer, Christopher P. Reale, and Peter Chin. 2019. Tree-Transformer:
from EFlops Cluster. We also thank the reviewers for their help- A Transformer-Based Method for Correction of Tree-Structured Data. CoRR
ful comments. Chuanyi Li and Jidong Ge are the corresponding abs/1908.00449 (2019). arXiv:1908.00449 http://arxiv.org/abs/1908.00449
[16] Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2021. Deberta:
authors. decoding-Enhanced Bert with Disentangled Attention. In 9th International Con-
ference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7,
2021. OpenReview.net. https://openreview.net/forum?id=XPZIaotutsD
REFERENCES [17] Dan Hendrycks and Kevin Gimpel. 2016. Bridging Nonlinearities and Stochastic
[1] Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2020. Regularizers with Gaussian Error Linear Units. CoRR abs/1606.08415 (2016).
A Transformer-based Approach for Source Code Summarization. In Proceedings arXiv:1606.08415 http://arxiv.org/abs/1606.08415
of the 58th Annual Meeting of the Association for Computational Linguistics, ACL [18] Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2018. Deep code comment
2020, Online, July 5-10, 2020, Dan Jurafsky, Joyce Chai, Natalie Schluter, and generation. In Proceedings of the 26th Conference on Program Comprehension, ICPC
Joel R. Tetreault (Eds.). Association for Computational Linguistics, 4998–5007. 2018, Gothenburg, Sweden, May 27-28, 2018, Foutse Khomh, Chanchal K. Roy, and
https://doi.org/10.18653/v1/2020.acl-main.449 Janet Siegmund (Eds.). ACM, 200–210. https://doi.org/10.1145/3196321.3196334

160

Authorized licensed use limited to: China University of Petroleum. Downloaded on April 11,2023 at 12:23:20 UTC from IEEE Xplore. Restrictions apply.
ICSE ’22, May 21–29, 2022, Pittsburgh, PA, USA Ze Tang, Xiaoyu Shen, Chuanyi Li, Jidong Ge, Liguo Huang, and Zhelin Zhu, Bin Luo

[19] Xing Hu, Ge Li, Xin Xia, David Lo, Shuai Lu, and Zhi Jin. 2018. Summarizing Association for Computational Linguistics, 464–468. https://doi.org/10.18653/
Source Code with Transferred API Knowledge. In Proceedings of the Twenty- v1/n18-2074
Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July [37] Xiaoyu Shen, Youssef Oualil, Clayton Greenberg, Mittul Singh, and Dietrich
13-19, 2018, Stockholm, Sweden, Jérôme Lang (Ed.). ijcai.org, 2269–2275. https: Klakow. 2017. Estimation of Gap Between Current Language Models and Human
//doi.org/10.24963/ijcai.2018/314 Performance. Proc. Interspeech 2017 (2017), 553–557.
[20] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. [38] Xiaoyu Shen, Jun Suzuki, Kentaro Inui, Hui Su, Dietrich Klakow, and Satoshi
Summarizing Source Code using a Neural Attention Model. In Proceedings of the Sekine. 2019. Select and Attend: Towards Controllable Content Selection in
54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, Text Generation. In Proceedings of the 2019 Conference on Empirical Methods in
August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Natural Language Processing and the 9th International Joint Conference on Natural
Computer Linguistics. https://doi.org/10.18653/v1/p16-1195 Language Processing (EMNLP-IJCNLP). 579–590.
[21] Seohyun Kim, Jinman Zhao, Yuchi Tian, and Satish Chandra. 2021. Code Predic- [39] Xiaoyu Shen, Yang Zhao, Hui Su, and Dietrich Klakow. 2019. Improving la-
tion by Feeding Trees to Transformers. In 43rd IEEE/ACM International Conference tent alignment in text summarization by generalizing the pointer generator. In
on Software Engineering, ICSE 2021, Madrid, Spain, 22-30 May 2021. IEEE, 150–162. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Pro-
https://doi.org/10.1109/ICSE43902.2021.00026 cessing and the 9th International Joint Conference on Natural Language Processing
[22] Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with (EMNLP-IJCNLP). 3753–3764.
Graph Convolutional Networks. In 5th International Conference on Learning [40] Yunsheng Shi, Zhengjie Huang, Shikun Feng, Hui Zhong, Wenjing Wang, and
Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Yu Sun. 2021. Masked Label Prediction: Unified Message Passing Model for
Proceedings. OpenReview.net. https://openreview.net/forum?id=SJU4ayYgl Semi-Supervised Classification. In Proceedings of the Thirtieth International Joint
[23] Alexander LeClair, Sakib Haque, Lingfei Wu, and Collin McMillan. 2020. Improved Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada,
Code Summarization via a Graph Neural Network. In ICPC ’20: 28th International 19-27 August 2021, Zhi-Hua Zhou (Ed.). ijcai.org, 1548–1554. https://doi.org/10.
Conference on Program Comprehension, Seoul, Republic of Korea, July 13-15, 2020. 24963/ijcai.2021/214
ACM, 184–195. https://doi.org/10.1145/3387904.3389268 [41] Yusuke Shido, Yasuaki Kobayashi, Akihiro Yamamoto, Atsushi Miyamoto, and
[24] Alexander LeClair, Siyuan Jiang, and Collin McMillan. 2019. A neural model for Tadayuki Matsumura. 2019. Automatic Source Code Summarization with Ex-
generating natural language summaries of program subroutines. In Proceedings tended Tree-LSTM. In International Joint Conference on Neural Networks, IJCNN
of the 41st International Conference on Software Engineering, ICSE 2019, Montreal, 2019 Budapest, Hungary, July 14-19, 2019. IEEE, 1–8. https://doi.org/10.1109/
QC, Canada, May 25-31, 2019, Joanne M. Atlee, Tevfik Bultan, and Jon Whittle IJCNN.2019.8851751
(Eds.). IEEE / ACM, 795–806. https://doi.org/10.1109/ICSE.2019.00087 [42] Vighnesh Leonardo Shiv and Chris Quirk. 2019. Novel positional encodings to
[25] Boao Li, Meng Yan, Xin Xia, Xing Hu, Ge Li, and David Lo. 2020. DeepCom- enable tree-based transformers. In Advances in Neural Information Processing
menter: a deep code comment generation tool with hybrid lexical and syntactical Systems 32: Annual Conference on Neural Information Processing Systems 2019,
information. In ESEC/FSE ’20: 28th ACM Joint European Software Engineering NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, Hanna M. Wallach,
Conference and Symposium on the Foundations of Software Engineering, Virtual Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and
Event, USA, November 8-13, 2020, Prem Devanbu, Myra B. Cohen, and Thomas Roman Garnett (Eds.). 12058–12068.
Zimmermann (Eds.). ACM, 1571–1575. https://doi.org/10.1145/3368089.3417926 [43] Giriprasad Sridhara, Emily Hill, Divya Muppaneni, Lori L. Pollock, and K. Vijay-
[26] Yuding Liang and Kenny Qili Zhu. 2018. Automatic Generation of Text Descriptive Shanker. 2010. Towards automatically generating summary comments for Java
Comments for Code Blocks. In Proceedings of the Thirty-Second AAAI Conference methods. In ASE 2010, 25th IEEE/ACM International Conference on Automated
on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Software Engineering, Antwerp, Belgium, September 20-24, 2010, Charles Pecheur,
Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances Jamie Andrews, and Elisabetta Di Nitto (Eds.). ACM, 43–52. https://doi.org/10.
in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 1145/1858996.1859006
2018, Sheila A. McIlraith and Kilian Q. Weinberger (Eds.). AAAI Press, 5229–5236. [44] Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16492 Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from
[27] Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. overfitting. J. Mach. Learn. Res. 15, 1 (2014), 1929–1958. http://dl.acm.org/citation.
In Text Summarization Branches Out. Association for Computational Linguistics, cfm?id=2670313
Barcelona, Spain, 74–81. https://www.aclweb.org/anthology/W04-1013 [45] Hui Su, Xiaoyu Shen, Zhou Xiao, Zheng Zhang, Ernie Chang, Cheng Zhang,
[28] Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. Cheng Niu, and Jie Zhou. 2020. Moviechats: Chat like humans in a closed domain.
In 7th International Conference on Learning Representations, ICLR 2019, New Or- In Proceedings of the 2020 Conference on Empirical Methods in Natural Language
leans, LA, USA, May 6-9, 2019. OpenReview.net. https://openreview.net/forum? Processing (EMNLP). 6605–6619.
id=Bkg6RiCqY7 [46] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbig-
[29] Paul W. McBurney and Collin McMillan. 2016. Automatic Source Code Sum- niew Wojna. 2016. Rethinking the Inception Architecture for Computer Vi-
marization of Context for Java Methods. IEEE Trans. Software Eng. 42, 2 (2016), sion. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR
103–119. https://doi.org/10.1109/TSE.2015.2465386 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society, 2818–2826.
[30] Laura Moreno, Jairo Aponte, Giriprasad Sridhara, Andrian Marcus, Lori L. Pollock, https://doi.org/10.1109/CVPR.2016.308
and K. Vijay-Shanker. 2013. Automatic generation of natural language summaries [47] Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved
for Java classes. In IEEE 21st International Conference on Program Comprehension, Semantic Representations From Tree-Structured Long Short-Term Memory Net-
ICPC 2013, San Francisco, CA, USA, 20-21 May, 2013. IEEE Computer Society, works. In Proceedings of the 53rd Annual Meeting of the Association for Computa-
23–32. https://doi.org/10.1109/ICPC.2013.6613830 tional Linguistics and the 7th International Joint Conference on Natural Language
[31] Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional Neural Processing of the Asian Federation of Natural Language Processing, ACL 2015, July
Networks over Tree Structures for Programming Language Processing. In Pro- 26-31, 2015, Beijing, China, Volume 1: Long Papers. The Association for Computer
ceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, Linguistics, 1556–1566. https://doi.org/10.3115/v1/p15-1150
2016, Phoenix, Arizona, USA, Dale Schuurmans and Michael P. Wellman (Eds.). [48] Ze Tang, Chuanyi Li, Jidong Ge, Xiaoyu Shen, Zheling Zhu, and Bin Luo. 2021.
AAAI Press, 1287–1293. http://www.aaai.org/ocs/index.php/AAAI/AAAI16/ AST-Transformer: Encoding Abstract Syntax Trees Efficiently for Code Summa-
paper/view/11775 rization. arXiv preprint arXiv:2112.01184 (2021).
[32] Genevieve B. Orr and Klaus-Robert Müller (Eds.). 1998. Neural Networks: Tricks [49] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
of the Trade. Lecture Notes in Computer Science, Vol. 1524. Springer. https: Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All
//doi.org/10.1007/3-540-49430-8 you Need. In Advances in Neural Information Processing Systems 30: Annual Con-
[33] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a ference on Neural Information Processing Systems 2017, 4-9 December 2017, Long
Method for Automatic Evaluation of Machine Translation. In Proceedings of the Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M.
40th Annual Meeting of the Association for Computational Linguistics, July 6-12, Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 5998–6008.
2002, Philadelphia, PA, USA. ACL, 311–318. https://www.aclweb.org/anthology/ http://papers.nips.cc/paper/7181-attention-is-all-you-need
P02-1040/ [50] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro
[34] Jordan B. Pollack. 1990. Recursive Distributed Representations. Artif. Intell. 46, Liò, and Yoshua Bengio. 2018. Graph Attention Networks. In 6th International
1-2 (1990), 77–105. https://doi.org/10.1016/0004-3702(90)90005-K Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30
[35] Heinz Prüfer. 1918. Neuer beweis eines satzes über permutationen. Arch. Math. - May 3, 2018, Conference Track Proceedings. OpenReview.net. https://openreview.
Phys 27, 1918 (1918), 742–744. net/forum?id=rJXMpikCZ
[36] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-Attention with [51] Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, Haochao Ying, Jian Wu, and
Relative Position Representations. In Proceedings of the 2018 Conference of the Philip S. Yu. 2018. Improving automatic source code summarization via deep
North American Chapter of the Association for Computational Linguistics: Human reinforcement learning. In Proceedings of the 33rd ACM/IEEE International Confer-
Language Technologies, NAACL-HLT, New Orleans, Louisiana, USA, June 1-6, 2018, ence on Automated Software Engineering, ASE 2018, Montpellier, France, September
Volume 2 (Short Papers), Marilyn A. Walker, Heng Ji, and Amanda Stent (Eds.). 3-7, 2018, Marianne Huchard, Christian Kästner, and Gordon Fraser (Eds.). ACM,

161

Authorized licensed use limited to: China University of Petroleum. Downloaded on April 11,2023 at 12:23:20 UTC from IEEE Xplore. Restrictions apply.
AST-Trans: Code Summarization with Efficient Tree-Structured Attention ICSE ’22, May 21–29, 2022, Pittsburgh, PA, USA

397–407. https://doi.org/10.1145/3238147.3238206 Table 9: Comparison of AST-Trans with different model selection


[52] Wenhua Wang, Yuqun Zhang, Zhengran Zeng, and Guandong Xu. 2020. TranSˆ3: strategy on Java Dataset.
A Transformer-based Framework for Unifying Code Summarization and Code
Search. CoRR abs/2003.03238 (2020). arXiv:2003.03238 https://arxiv.org/abs/2003.
03238 Model BLEU METEOR ROUGE-L
[53] Bolin Wei, Ge Li, Xin Xia, Zhiyi Fu, and Zhi Jin. 2019. Code Generation as a AST-Trans(best_eval_BLEU) 48.29 30.94 55.85
Dual Task of Code Summarization. In Advances in Neural Information Processing
AST-Trans(best_eval_METEOR) 47.02 31.90 55.72
Systems 32: Annual Conference on Neural Information Processing Systems 2019,
NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, Hanna M. Wallach, AST-Trans(best_eval_ROUGE-L) 46.92 29.99 57.01
Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and
Roman Garnett (Eds.). 6559–6569.
[54] Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris
Alberti, Santiago Ontañón, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang,
scores 𝛼˜ have a different shape with traditional attention scores,
and Amr Ahmed. 2020. Big Bird: Transformers for Longer Sequences. In Ad- so we redesigned the softmax function in line 11-16. The atten-
vances in Neural Information Processing Systems 33: Annual Conference on Neural tion scores belonging to the same query vector, distinguished by
Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual,
Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, 𝐶𝑂𝑂𝑟𝑜𝑤 [𝑖 ∗ 𝑁 + 𝑗], are added together as 𝛼˜𝑠𝑢𝑚 . Then the softmax
and Hsuan-Tien Lin (Eds.). https://proceedings.neurips.cc/paper/2020/hash/ function can be formed as 𝛼˜ divide by 𝛼˜𝑠𝑢𝑚 . Finally in line 17-21,
c8512d142a2d849725f31a9a7a361ab9-Abstract.html relative distance bias 𝑉 𝑃 is added to the value context, and then is
[55] Yang Zhao, Xiaoyu Shen, Wei Bi, and Akiko Aizawa. 2019. Unsupervised rewriter
for multi-sentence compression. In Proceedings of the 57th Annual Meeting of the multiplied with the attention scores 𝛼. ˜
Association for Computational Linguistics. 2235–2240.
[56] Yuxiang Zhu and Minxue Pan. 2019. Automatic Code Summarization: A Sys- B THE INFLUENCE OF MODEL SELECTION
tematic Literature Review. CoRR abs/1909.04352 (2019). arXiv:1909.04352
http://arxiv.org/abs/1909.04352 STRATEGY
[57] Daniel Zügner, Tobias Kirschstein, Michele Catasta, Jure Leskovec, and Stephan
Günnemann. 2021. Language-Agnostic Representation Learning of Source Code The results reported in the paper come from the model with best
from Structure and Context. In 9th International Conference on Learning Rep- BLEU score in the validation dataset. We then separately select
resentations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
https://openreview.net/forum?id=Xh5eMZVONGF
two other models with the best METEOR, and ROUGE-L score
in the valid dataset, and then evaluate their performances on test
dataset. Results in Table 9 show that the model selection strategy
A ALGORITHM OF GDC
indeed influences the performance. This may explain why that the
improvement of AST-Trans is inconsistent in different metrics.
Algorithm 1 Self-Attention with Relationship matrix
Input: Hidden state 𝑯 , COO format of relationship martix 𝐶𝑂𝑂,
content functions 𝑸, 𝑲, 𝑽 , relative distance projection matrix
𝑸𝑷, 𝑲𝑷, 𝑽 𝑷 .
1: 𝑲𝒄 = 𝑲 (𝐻 ), 𝑸 𝒄 = 𝑸 (𝐻 ), 𝑽𝒄 = 𝑽 (𝐻 )
2: for 𝑖 = 0, . . . , 2𝑃 + 1 do
3: for 𝑗 = 0, . . . , 𝑁 − 1 do
4: 𝑸˜𝒄 [𝑖; 𝑗; :] = 𝑸 𝒄 [𝑪𝑶𝑶 𝒄𝒐𝒍 [𝑖 ∗ 𝑁 + 𝑗]; :]
5: 𝑲˜𝒄 [𝑖; 𝑗; :] = 𝑲𝒄 [𝑪𝑶𝑶 𝒓 𝒐𝒘 [𝑖 ∗ 𝑁 + 𝑗]; :]
6: 𝑽˜𝒄 [𝑖; 𝑗; :] = 𝑽𝒄 [𝑪𝑶𝑶 𝒓 𝒐𝒘 [𝑖 ∗ 𝑁 + 𝑗]; :]
7: end for
8: end for
9: 𝜶˜ = (𝑸 𝒄 + 𝑸 𝑷 )  (𝑲𝒄 + 𝑲 𝑷 ) − 𝑸 𝑷  𝑲 𝑷
10: 𝜶˜ = exp( √ )
𝜶˜
3𝑑
11: for 𝑖 = 0, . . . , 2𝑃 + 1 do
12: for 𝑗 = 0, . . . , 𝑁 − 1 do
13: 𝜶˜ 𝒔𝒖𝒎 [:; 𝑪𝑶𝑶 𝒓 𝒐𝒘 [𝑖 ∗ 𝑁 + 𝑗]]+ = 𝜶˜ [𝑖, 𝑗]
14: end for
15: end for
16: 𝜶˜ = 𝜶˜
𝜶˜
𝒔𝒖𝒎
17: for 𝑖 = 0, . . . , 2𝑃 + 1 do
18: for 𝑗 = 0, . . . , 𝑁 − 1 do
19: 𝒐˜ [𝑪𝑶𝑶 𝒓 𝒐𝒘 [𝑖 ∗ 𝑁 + 𝑗]; :] = ( 𝑽˜𝒄 [𝑖; 𝑗; :] + 𝑽 𝑷 [𝑖; :]) · 𝜶˜ [𝑖, 𝑗]
20: end for
21: end for
Output: 𝒐˜

For better re-implementation, we also show the algorithm of


GDC. line 1-10 describes the attention score computation process.
𝑄˜𝑐 , 𝐾˜𝑐 and 𝑉˜𝑐 are reshaped to [2𝑃 + 1, 𝑁 , 𝑑]. Note that the attention

162

Authorized licensed use limited to: China University of Petroleum. Downloaded on April 11,2023 at 12:23:20 UTC from IEEE Xplore. Restrictions apply.

You might also like