0% found this document useful (0 votes)
19 views12 pages

Mastropaolo CodeSummarization

This document discusses an approach to train deep learning models to summarize code snippets by first building a dataset of over 6,600 manually classified and linked code comments. It trains a multi-task model to classify comments as code summaries and link them to code statements, which identifies summaries with 84% accuracy. This model is then run on 10,000 projects to build a dataset of documented code snippets, which is used to train a new model to automatically summarize code snippets at a finer granularity than existing methods focused on entire methods. An evaluation finds this approach outperforms baselines but still has room for improvement in accurately summarizing snippets.

Uploaded by

malaysheth34
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views12 pages

Mastropaolo CodeSummarization

This document discusses an approach to train deep learning models to summarize code snippets by first building a dataset of over 6,600 manually classified and linked code comments. It trains a multi-task model to classify comments as code summaries and link them to code statements, which identifies summaries with 84% accuracy. This model is then run on 10,000 projects to build a dataset of documented code snippets, which is used to train a new model to automatically summarize code snippets at a finer granularity than existing methods focused on entire methods. An evaluation finds this approach outperforms baselines but still has room for improvement in accurately summarizing snippets.

Uploaded by

malaysheth34
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Towards Summarizing Code Snippets

Using Pre-Trained Transformers


Antonio Mastropaolo Matteo Ciniselli Luca Pascarella
SEART @ Software Institute, SEART @ Software Institute, Center for Project-Based Learning,
Università della Svizzera Italiana Università della Svizzera Italiana ETH Zurich
Lugano, Switzerland, CH Lugano, Switzerland, CH Zurich, Switzerland, CH

Rosalia Tufano Emad Aghajani Gabriele Bavota


SEART @ Software Institute, SEART @ Software Institute, SEART @ Software Institute,
Università della Svizzera Italiana Università della Svizzera Italiana Università della Svizzera Italiana
arXiv:2402.00519v1 [cs.SE] 1 Feb 2024

Lugano, Switzerland, CH Lugano, Switzerland, CH Lugano, Switzerland, CH

ABSTRACT CCS CONCEPTS


When comprehending code, a helping hand may come from the • Software and its engineering → Extra-functional properties.
natural language comments documenting it that, unfortunately, are
not always there. To support developers in such a scenario, several KEYWORDS
techniques have been presented to automatically generate natural software documentation
language summaries for a given code. Most recent approaches
ACM Reference Format:
exploit deep learning (DL) to automatically document classes or Antonio Mastropaolo, Matteo Ciniselli, Luca Pascarella, Rosalia Tufano,
functions, while little effort has been devoted to more fine-grained Emad Aghajani, and Gabriele Bavota. 2024. Towards Summarizing Code
documentation (e.g., documenting code snippets or even a single Snippets Using Pre-Trained Transformers. In Proceedings of 32nd Interna-
statement). Such a design choice is dictated by the availability of tional Conference on Program Comprehension (ICPC 2024). ACM, New York,
training data: For example, in the case of Java, it is easy to create NY, USA, 12 pages. https://doi.org/XXXXXXX.XXXXXXX
datasets composed of pairs <𝑚𝑒𝑡ℎ𝑜𝑑, 𝑗𝑎𝑣𝑎𝑑𝑜𝑐> that can be fed
to DL models to teach them how to summarize a method. Such 1 INTRODUCTION
a comment-to-code linking is instead non-trivial when it comes Empirical studies showed that code comprehension can take up
to inner comments documenting a few statements. In this work, to 70% of developers’ time [42, 68]. While code comments can
we take all the steps needed to train a DL model to automatically support developers in such a process [12], their availability [52]
document code snippets. First, we manually built a dataset featuring and consistency with the documented code [13, 14, 35] cannot be
6.6k comments that have been (i) classified based on their type (e.g., taken for granted. A helping hand may come from tools proposed
code summary, TODO), and (ii) linked to the code statements they in the literature to automatically document code [3, 5, 16, 19, 22, 26,
document. Second, we used such a dataset to train a multi-task 30, 39, 43, 48, 50, 53, 53, 66, 67, 70]. The most recent techniques (e.g.,
DL model taking as input a comment and being able to (i) classify [19, 22, 30]) train deep learning (DL) models with the aim of learning
whether it represents a “code summary” or not, and (ii) link it to the how to summarize a given piece of code in natural language. This
code statements it documents. Our model identifies code summaries requires the building of a large-scale dataset composed by pairs
with 84% accuracy and is able to link them to the documented lines <𝑐𝑜𝑑𝑒, 𝑑𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛> that can be used to feed the model with 𝑐𝑜𝑑𝑒
of code with recall and precision higher than 80%. Third, we run instances asking it to generate their 𝑑𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛. These approaches
this model on 10k projects, identifying and linking code summaries are usually trained to work at function-level granularity. This means
to the documented code. This unlocked the possibility of building that, in the case of Java, methods are mined from open source
a large-scale dataset of documented code snippets that have then projects and linked to the first sentence of their Javadoc which is
been used to train a new DL model able to automatically document assumed to represent a plausible code summary.
code snippets. A comparison with state-of-the-art baselines shows Having such a granularity could be, however, suboptimal to
the superiority of the proposed approach, which however, is still far support comprehension activities. Indeed, while the overall goal of
from representing an accurate solution for snippet summarization. a method might be clear to a developer, they may not understand a
specific set of statements in it. Also, looking at the datasets used
in the literature to train these models, we found that the methods’
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed descriptions extracted from the Javadoc are usually very short.
for profit or commercial advantage and that copies bear this notice and the full citation For example, the seminal dataset by LeClair and McMillan [32],
on the first page. Copyrights for components of this work owned by others than ACM features an average of 7.6 words (median=8.0) to summarize each
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a Java method. While such short descriptions could provide a grasp
fee. Request permissions from [email protected]. about the overall goal of the method, it is unlikely that they can
ICPC 2024, April 2024, Lisbon, Portugal actually support a developer struggling to understand it.
© 2024 Association for Computing Machinery.
ACM ISBN 978-1-4503-XXXX-X/18/06. . . $15.00 For this reason, a few attempts have been made to automatically
https://doi.org/XXXXXXX.XXXXXXX summarize code snippets rather than entire functions [3, 24, 48,
ICPC 2024, April 2024, Lisbon, Portugal Antonio Mastropaolo, Matteo Ciniselli, Luca Pascarella, Rosalia Tufano, Emad Aghajani, and Gabriele Bavota

53, 60, 66, 67]. Most of them are based on information retrieval messages put on top of a method call getMessages()). These com-
[3, 48, 66, 67] meaning that, given a code snippet 𝐶𝑆 to document, ments are useless to train a code summarizer, but are not excluded
the most similar snippet to it is identified in a previously built from the RL-BlockCom training dataset.
dataset and its comments are reused to summarize 𝐶𝑆. These ap- 3. The training dataset used in RL-BlockCom includes code sum-
proaches, while valuable, rely on manually crafted heuristics to maries as short as two words [24]. These are unlikely to be code
automatically identify the “scope of an inner comment”, i.e., the summaries useful to support program comprehension.
statements that a given comment documents. For example, one may To address these limitations, in this work we take all steps needed
assume that an //inline comment in Java documents all following to foster the research on snippets summarization, as depicted in
statements until a blank line is found [8]. As we will show, such Fig. 1. First (step 1 in Fig. 1), we manually built a dataset of 6,645
a heuristic fails in several cases. Other techniques [53, 60] exploit <𝑠𝑛𝑖𝑝𝑝𝑒𝑡, 𝑑𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛> pairs, in which we classified the code com-
pre-defined templates to document code snippets that, however, ment (𝑑𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛) as being or not a code summary and linked it to
cannot generalize to all combinations of code statements one could the documented Java statements. Such a dataset has been built by
find. ensuring two evaluators for each analyzed comment, with a third
Given the limitations of previous work, Huang et al. [24] pro- one solving conflicts when needed. The overall effort spent by the
posed an approach exploiting reinforcement learning to document six involved authors accounts for 815 man-hours.
code snippets. The first challenge they faced was the creation of a We use this dataset to fine-tune SALOON (step 2 in Fig. 1), a multi-
training dataset. Indeed, while it is relatively easy to collect pairs of task pre-trained Text-to-Text-Transfer-Transformer (T5) [47] model
<𝑐𝑜𝑑𝑒, 𝑑𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛> when working at function-level granularity, able to take as input an inner comment in a method and (i) classify
this is not the case for code snippets. For this reason, Huang et al. whether it represents a valid code summary with a 83% accuracy;
exploited an approach proposed by Chen et al. [8] to automatically and (ii) link it to the relevant code snippets it documents with a
detect the scope of code comments. The approach exploits a combi- recall/precision higher than 80%. We show that the performance of
nation of heuristics and learning-based techniques to automatically SALOON are significantly better than the comment-to-code linking
identify, given a comment, the set of statements documented by approach by Chen et al. [8].
it. Using this approach, Huang et al. [24] built a dataset of ∼124k Finally (step 3 in Fig. 1), we run SALOON on 10k GitHub Java
<𝑠𝑛𝑖𝑝𝑝𝑒𝑡, 𝑑𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛> pairs which has been used to train RL- projects to automatically build a large-scale dataset of ∼554k <𝑠𝑛𝑖𝑝𝑝𝑒𝑡,
BlockCom, a DL model combining reinforcement learning with a 𝑑𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛> pairs. The latter has been used to train and test STUNT,
classic encoder-decoder model. RL-BlockCom is able, given a code a DL-based approach taking as input a code snippet and automati-
snippet as input, to automatically document it reaching a BLEU-4 cally generating its code summary. We show that STUNT performs
[44] of 24.28. While being the first DL-based approach to support better than IR-based and RL-based baselines RL-BlockCom.
code snippets’ summarization, RL-BlockCom suffers of some major Despite this finding, our results also show that STUNT is not
limitations mostly related to the way in which its training/test sets yet ready to be deployed to developers and point to more research
have been built exploiting the approach in [8]: being needed on the task of snippet summarization.
1. Simplified/unrealistic linking of code comment to the documented In summary, our contributions are: (i) the largest manually built
snippet [8]. This is due to some of the design choices made in the dataset in the literature featuring classified and linked code com-
scope detection approach [8]. For example, the authors “regard the ments; (ii) SALOON, a multi-task DL model able to achieve state-of-
first out-of-scope statement as the demarcation point of the scope of the-art performance in the tasks of comment classification and link-
the comment”. This means that, accordingly to their approach, it ing; and (iii) STUNT, a code snippet summarization model trained
is not possible for a code comment to document non-contiguous on a large-scale and more realistic dataset as compared to the one
statements. As we will show, our manual validation of 6,645 in- used in the literature [24]. The dataset and all code used to train
stances reveals 1598 (∼27%) cases of code comments that document and test the models in this paper are available in our replication
non-contiguous statements. These are all cases which cannot be package [7].
successfully supported by the scope detection approach and, as a
consequence, by RL-BlockCom.
2. Lack of filters to identify code summaries [8]. Chen et al. cor-
rectly observed that not all comments “describe” code statements.
2 BUILDING A DATASET OF DOCUMENTED
Thus, they use heuristics to remove commented out code, TODO CODE SNIPPETS
comments, IDE-generated comments, and non-text comments con- We detail the process used to build a manually validated dataset
taining dates or links. Despite these filters, using such an approach featuring triplets <𝐷, {𝐶𝐶}, 𝐷𝐶> where 𝐷 represents a natural
to create a training dataset for a snippet summarization approach language comment documenting the code snippet 𝐷𝐶 (Documented
such as RL-BlockCom means feeding it with comments which may Code) and {𝐶𝐶} represents the Comment Category (e.g., code sum-
not be an actual code summary of the documented snippet. For mary, TODO comment), with more than one category possibly
example, when manually looking at the previously mentioned 6,645 being assigned to the comment. We later use such a dataset to train
instances, we found 33% of them to just act as a logical split of source and evaluate the model described in Section 3, taking as input a
code (i.e., a “formatting” comment [46]) without providing addi- comment 𝐷 and automatically (i) classifying it, thus being able to
tional information on the documented code (e.g., a comment //get check whether 𝐷 is a code summary (i.e., an actual description of
the documented code) or another type of comment (e.g., TODOs),
and (ii) linking 𝐷 to the corresponding documented code 𝐷𝐶.
Towards Summarizing Code Snippets Using Pre-Trained Transformers ICPC 2024, April 2024, Lisbon, Portugal

Creating a dataset of documented code snippets SALOON: A technique for the automated linking of comments to code statements STUNT: Automated generation of code snippet
1 2 3 summaries
Comment labelling Classification Linking Mining Comment generation
We labeled the dataset and solved conflicts Identify summary Link the comment to Mined snippets and Given the code we generate a
comment the documented code their referred comment comment description

Comment
labelling

Third //comment SALOON //comment


Collected
reviewer 555k code-
comments from
comment pairs
1.5k Java files

Figure 1: Approach Overview

2.1 Study Design use was taken from the work by Pascarella et al. [46] and included:
As a first step to build our dataset we needed to collect the set of summary, rationale, deprecation, usage, exception, TODO, incomplete,
code comments 𝐷 1 , 𝐷 2 , . . . , 𝐷𝑛 to manually analyze. To collect commented code, formatter, and pointer. We do not describe these
these comments, we used the web application by Dabic et al. [11] categories due to the lack of space, pointing the reader to [46] for
to query GitHub for all Java projects having at least 500 commits, a complete description. However, as concrete examples, summary
25 contributors, 10 stars, and not being forks. These filters aim represents the classic code description explaining what the code is
at discarding personal/toy projects and reducing the chance of about, formatter is a comment used by developers to better organize
mining duplicated code. The focus on Java was dictated by the will the code into logical sections, while pointer refers to comments
of accommodating the expertise of the manual validators (i.e., the linking external resources. We excluded from the original list by
authors) all having extensive knowledge of the Java programming Pascarella et al. [46] the following categories (i) directive and au-
language. Despite the focus on Java, our methodology to build the togenerated since, as described by the authors, they both concern
dataset as well as to train the models described in the subsequent comments automatically generated by the IDE; and (ii) license and
sections is general and can be reproduced for different languages. ownership, since this information is usually featured in Javadoc
We randomly cloned 100 of the 1,681 projects resulting from our comments.
search on GitHub, for a total of ∼768k Java files. Finally, we merged the expand category into summary, since the
We parsed their code to identify comments within each method former is defined by the authors as a code description providing
to manually analyze. We ignored Javadoc comments since they more information than a usual summary. Such a distinction is ir-
document entire methods rather than code snippets: We only con- relevant for our work. Besides the set of predefined categories, we
sidered single-line (starting with “//”) and multi-line (starting with also gave the possibility to evaluators to define new categories. If
“/*”) comments as subject of our manual analysis. Also, we did not an evaluator defined a new category, it was immediately visible to
extract comments from test methods (i.e., methods annotated with all other evaluators. The following additional categories have been
@Test) to increase the cohesiveness of our dataset and only focus defined by us: orphan, indicating a code comment not linked to any
on documentation related to production code. The manual analysis line of code, and code example, indicating a comment describing
has been performed by the six authors (from now on, evaluators) e.g., how to invoke a specific method.
through a web app we developed to support the process. Once the category for a given comment under analysis was
We targeted the labeling of valid comments (i.e., excluding those defined, the next step was the linking of the comment to the doc-
removed by the above-described procedure) within 1,500 Java files, umented code 𝐷𝐶. The linking has been performed at line-level
with the idea of creating a dataset of ∼10k triplets (<𝐷, {𝐶𝐶}, 𝐷𝐶>). granularity. This means, for example, that for a comment 𝐷 the
The web app assigned each Java file to two evaluators who indepen- evaluator could indicate lines 11, 12, and 17 as documented. Note
dently labeled the comments in it. If the number of comments in a that gaps are possible in 𝐷𝐶 (i.e., the documented code could be
file was higher than 10, the web app randomly selected a number composed by non-contiguous lines). Our replication package [7]
of comments to label going from 10 to 𝑚, where 𝑚 was the actual shows concrete examples of this scenario, that we omit here due
number of valid comments in the file. Otherwise all comments in to space limitations. Then, we started resolving conflicts arisen
the file were labeled. We opted for this process to avoid an evalu- from the manual analysis. Two types of conflicts are possible for
ator being stuck too much time on a single file. Also, we did not each manually defined triplet <𝐷, {𝐶𝐶}, 𝐷𝐶>: The two evaluators
consider comments belonging to methods longer than 1,024 tokens could have (i) selected a different set {𝐶𝐶} when classifying the
and made sure no duplicated methods were present in the final comment; and (ii) identified different sets of lines (𝐷𝐶) documented
dataset (i.e., the same method might be present across different by the comment. Out of the 6,645 manually labeled comments, 1,395
files/projects). The filter on the method length was driven by the (21%) resulted in a conflict: 1,144 were due to different comment
final usage we envision for our dataset, namely training DL-models categories selected by the evaluators; 47 to differences in the se-
which usually works on inputs of limited size (≤512 tokens, or even lected 𝐷𝐶; 204 concerned both the categories and the 𝐷𝐶. Conflicts
less, see e.g., [19, 36, 37, 54–56]). Thus, labeling instances longer were solved by a third evaluator not involved in the labeling of the
than 1,024 tokens would have been a waste of resources. conflicting instance.
The goal of the labeling was to firstly assign the comment 𝐷
to one or more categories 𝐶𝐶s. The starting set of categories to
ICPC 2024, April 2024, Lisbon, Portugal Antonio Mastropaolo, Matteo Ciniselli, Luca Pascarella, Rosalia Tufano, Emad Aghajani, and Gabriele Bavota

Overall, we spent 815 man-hours on the labeling and conflict Interestingly, 1,598 of the comments in our dataset (∼27%) in-
resolution, manually annotating 6,645 comments (with two evalua- clude “gaps” in the linked code. This means, for example, that a
tors for each of them) coming from 1,508 Java files and 85 software comment documents lines 11, 12, and 17 (but not lines 13-16) — see
projects. We labeled a bit more than the target 1,500 since multiple [7] for concrete examples. This means that approaches to automati-
evaluators were working in parallel without noticing that we hit our cally link comment and code must take such a scenario into account.
target. The obtained dataset, publicly available in our replication Motivated by these insights, we fill this gap by creating a novel
package [7], is briefly described in the following. method for classifying and linking code comments, as elucidated
in Section 3.
2.2 Dataset
3 AUTOMATIC CLASSIFICATION OF CODE
Table 1: Dataset output of manual labeling
COMMENTS AND LINKAGE TO
DOCUMENTED CODE
Documented Statements
Category #Instances We start by presenting SALOON (claSsification And Linking Of
mean median sd
cOmmeNts), the approach we devised for the classification of code
Summary 3,841 3.40 3.0 2.70
comments and their linking to the documented code (Section 3.1).
Formatting 2,209 2.32 2.0 2.65
Rationale 983 3.04 2.0 2.74
Then, we discuss the design of the study we run to assess its accu-
TODO 258 0.46 0.0 1.16 racy (Section 3.2) and the achieved results (Section 3.3).
Commented Code 184 0.00 0.0 0.00 Once trained, SALOON can be run on hundreds of projects to
Pointer 33 2.66 2.0 5.27 build a large-scale dataset featuring classified and linked code com-
Orphan 29 0.00 0.0 0.00 ments. While we could just refer to SALOON as a “T5 model trained
Code Example 9 1.77 2.5 1.48 for comment classification and linking”, we preferred to name it
Deprecation 7 3.14 3.0 1.34 to simplify the reading when we introduce the other T5 model we
Incomplete 2 1.5 1.5 0.70 train for the task of code summarization (Section 4).
Overall 6,645 1.83 1.60 1.80
3.1 Approach Description
Table 1 summarizes the dataset obtained as output of our analysis. SALOON is built on top of T5, a DL transformer-based model [47].
We excluded from the table the categories for which we did not T5 has been presented by Raffel et al. [47] as a model that can be
find any instance (e.g., exception [46], likely to be more prevalent trained to support any Natural Language Processing (NLP) task
in Javadoc comments). Since a single comment can be associated that can be represented in a text-to-text format, meaning that both
to multiple categories (e.g., summary and rationale), the sum of the input and the output of the model are text strings. Such a
the “#Instances” column does not add up to the total number of representation is well-suited for code-related tasks, as demonstrated
comments we manually classified (i.e., 6,645). by the recent literature (see e.g., [38, 56, 62]).
Besides reporting the categories to which the comments in our Raffel et al. [47] reported state-of-the-art results for several
dataset belong, Table 1 also shows descriptive statistics related to NLP benchmarks, especially when leveraging the “pretrain-then-
the number of statements documented by comments belonging finetune” paradigm: The model is first pre-trained on a large dataset
to different categories. As expected, orphan and commented code with the goal of learning patterns about the underlying language of
comments are not linked to any code statement. More than 80% of interest (e.g., Java). Then, it is fine-tuned to learn a specific task of
TODO comments are also not linked to any statement, since in many interest (e.g., code summarization). The pre-training is performed
cases todos are related to e.g., feature that must be implemented. using self-supervised pre-training objectives such as the masked
Similarly, the only two incomplete comments we found both of language model.
them not linked to any code: These are partially written comments The idea is to provide the model with input sentences (e.g., Java
needing rework. methods) in which a percentage of randomly selected tokens has
The most frequent category is, as expected, the summary one been masked, with the model in charge of guessing them. This
(3,841 instances) grouping comments summarizing one or more prepares the model’s weights for the fine-tuning in which tailored
code statements (on average, 3.40 statements). Another popular datasets are used to teach the model the specific task to support (e.g.,
category is “formatting”, with 2,209 instances. pairs of code and comments). The pre-training phase is particularly
While one could expect no code linked to formatting comments, important when the dataset used for the fine-tuning is expensive to
this is actually not the case since we used such a category also for build (i.e., it requires manual validation) and, as a consequence, is
comments not adding new information to the documented code limited in size. This is the case for our work, since our fine-tuning is
but just acting as a logical split of the code (e.g., a comment //get performed on the dataset described in Section 2, in which comments
messages put on top of a method call getMessages()). have been categorized and linked to the relevant statements.
Finally, comments explaining the rationale for implementation In SALOON, we exploit the T5small architecture described by
choices account for 983 instances. While we focus on the genera- Raffel et al. [47]. Due to space constraints, we point the reader to
tion of code summaries, these instances often contains interesting the original paper for all architectural details. We describe how
information that are hard to automatically synthesize and could we built the pre-training and fine-tuning datasets for the tasks of
represent a seed for future research. comment classification and linking.
Towards Summarizing Code Snippets Using Pre-Trained Transformers ICPC 2024, April 2024, Lisbon, Portugal

3.1.1 Pre-training Dataset. We start from the Java CodeSearchNet As for the expected output 𝐷𝐶 (i.e., documented code), it is rep-
dataset [25], which features ∼1.6M Java methods, ∼499k of which resented as a stream of “<N>” tags representing the line numbers
including a Javadoc. Given the tasks we aim at supporting (i.e., (i.e., statements) within 𝑀 𝑗 linked to 𝐷𝑖 (e.g., <1><2><4>). Such a
automatic classification of code comments and linking to the code representation allows marking non-contiguous statements docu-
they document), there are two “target languages” we aim to expose mented by 𝐷𝑖 . The code linking fine-tuning dataset is composed by
to T5 during pre-training: Java code and technical natural language 3,841 instances split into 80% training, 10% evaluation, and 10% test
in the form of code comments. CodeSearchNet features both of as shown in the second row of Table 2. Note that to ensure a fair
them. We preprocess the dataset by discarding all instances having evaluation of the proposed approach, we split the dataset by taking
#tokens > 1,024. During pre-training we treat Java methods and into consideration the Java class from which these methods were
Javadoc comments as separated instances (i.e., we ignore their asso- originally extracted.
ciation), thus removing Java methods and Javadoc comments being
longer than 1,024 tokens. Such a filter removed ∼32k instances (i.e.,
31,702 methods and 178 Javadoc comments). Then, we excluded Table 2: Fine-tuning datasets
instances containing non-ASCII characters as well as Javadoc com-
ments composed by less than 5 tokens (words), since unlikely to
Task Train Eval Test
represent meaningful code descriptions (∼57k instances removed).
After removing duplicates, we end up with 1,870,888 pre-training Comment Classification 4,833 726 1,203
instances (1,501,013 Java methods and 369,875 Javadoc). Code Linking 2,805 403 633

3.1.2 Fine-tuning Dataset. Two fine-tuning datasets are needed to


support the tasks we target (i.e., comment classification and linking).
For comment classification, we built a dataset composed by pairs 3.1.3 Training Procedure and Hyperparameters Tuning. We eval-
⟨𝑀 𝑗,𝐷𝑖 , 𝐶𝑐 ⟩, in which a specific inner comment 𝐷𝑖 within a method uated the performance of eight T5 models (four pre-trained and
𝑀 𝑗 is linked to a category 𝐶𝑐 classifying it (e.g., code summary). For four non pre-trained) on the evaluation set of each task in terms of
comment-to-code linking, we built a dataset featuring pairs ⟨𝑀 𝑗,𝐷𝑖 , correct predictions, namely cases in which the generated output
𝐷𝐶⟩, in which 𝐷𝐶 reports the 𝑀 𝑗 ’s statements documented by 𝐷𝑖 . (i.e., the comment category or the documented statements) was
Both datasets have been extracted from the manually built dataset identical to the expected output.
of 6,645 classified and linked comments (Section 2). We pre-train the T5 model from scratch (i.e., starting from ran-
Comment classification. Given the goal of our work (i.e., sum- dom weights) rather than starting from already pre-trained models
marizing code snippets), we are interested in automatically identi- for code such as CodeT5 [63], which is based on the same architec-
fying comments we classified as code summary while excluding all ture proposed by Raffel et al. [47] we exploit in our investigation.
the others. Starting from the dataset in Table 1, we extracted 3,841 Our decision is primarily motivated by the desire to have a model
⟨𝑀 𝑗,𝐷𝑖 , 𝐶𝑐 ⟩ having 𝐶𝑐 = code summary and 2,921 having having pre-trained on a single programming language (Java) as opposed
𝐶𝑐 = other. Basically, we target the training of a binary classifier to a multi-language model (as CodeT5).
taking as input a code comment (𝐷𝑖 ) in the context of the method it We pre-train T5 for 300k steps using a 2x2 TPU topology (8 cores)
belongs to (𝑀 𝑗 ) and guessing whether it is a code summary or not. from Google Colab with a batch size of 16. During pre-training, we
The specific input we provide to T5 is 𝑀 𝑗 ’s code with special randomly mask 15% of tokens in an instance (i.e., Java method or
tokens <comment></comment> surrounding the comment of inter- Javadoc comment), asking the model to guess the masked tokens.
est (this is the representation of 𝑀 𝑗,𝐷𝑖 ), and expect as output either To avoid over-fitting, we monitored the loss function every 10k
“code summary” or “other” (i.e., 𝐶𝑐 ). steps and stopped the training if such value did not improve after 12
Differently from the pre-training dataset, we did not need to consecutive evaluations (i.e., after 120k steps, one epoch on our pre-
remove sequences longer than 1,024 tokens, since this has already training dataset). We use the canonical T5𝑠𝑚𝑎𝑙𝑙 configuration [47]
been done in the first place during the building of the dataset de- during pre-training. We also used the pre-training dataset to train
scribed in Section 2. We randomly split the dataset into 80% training, a SentencePiece model (i.e., a tokenizer for neural text processing)
10% evaluation, and 10% test. The first row in Table 2 shows the with vocabulary size set to 32k word pieces.
number of instances in these three sets. We fine-tuned a pre-trained and a non pre-trained model experi-
Code Linking. Concerning the task of liking comments to code menting with four different learning rate schedulers (thus leading
snippets, our training instances are only those comments that we to eight overall trained models).
manually labelled as code summary. Indeed, we are interested in Constant Learning Rate (C-LR) fixes the learning rate during
linking this specific type of comments to their code. Thus, we the whole training; Inverse Square Root Learning Rate (ISR-LR), in
start from the 3,841 code summary instances to build the needed which the learning rate decays as the inverse square root of the
⟨𝑀 𝑗,𝐷𝑖 , 𝐷𝐶⟩ pairs. Concerning the representation of 𝑀 𝑗,𝐷𝑖 , it is training step; Slanted Triangular Learning Rate (ST-LR), in which
similar to the previously discussed for the comment classification the learning rate first linearly increases and then linearly decays
dataset (i.e., the method 𝑀 𝑗 with special tags surrounding the inner to the starting value; and Polynomial Decay Learning Rate (PD-
comment of interest 𝐷𝑖 ) with the only difference being a special LR), having the learning rate decaying polynomially from an initial
tag <N> preceding each statement and reporting its line number in value to an ending value in the given decay steps. The parameters
an incremental fashion. used for the learning rates are available in [7].
ICPC 2024, April 2024, Lisbon, Portugal Antonio Mastropaolo, Matteo Ciniselli, Luca Pascarella, Rosalia Tufano, Emad Aghajani, and Gabriele Bavota

We fine-tuned each of the eight models for a total of 75k steps ML-based solution [8]. The approach by Chen et al. [8] re-
on the fine-tuning training set of each task. We include in our lies on the random forest machine learning algorithm to classify
replication package [7] a table showing the percentage of correct statements in a method as linked or not to a given comment. Un-
predictions (for the comment classification task), precision and recall fortunately, the source code of such approach is not available and,
(for the code linking task) achieved by each of the pre-trained and thus, we had to reimplement it following the description in the
non pre-trained models on the evaluation sets. corresponding article. In a nutshell, the approach works as follows.
Overall, the pre-trained models work substantially better, es- The random forest uses three families of features to characterize
pecially when it comes to the code linking task. In particular, in a given statement and classify it as linked or not to a given com-
their respective best configuration, pre-trained models achieve (i) ment. The first family comprises eight “code features”, capturing
a 75% classification accuracy in the comment classification task as characteristics of the statement, such as the statement type (e.g.,
compared to the 58% of the non pre-trained models; and (ii) 85% if, for) and whether the statement shares method calls with the
precision and 89% recall in the code linking task, as compared to the statements preceding and following it. The second family includes
53% precision and 67% recall of the non pre-trained models. Such a four “comment features”, focusing on characteristics of the com-
result is expected considering that the fine-tuning training datasets ment of interest, such as its length and the number of verbs/nouns
are quite small due to the substantial manual effort required to it contains. Finally, the third family groups four “relationship fea-
build them (∼6.7k instances for comment classification and ∼3.8k tures”, representing the relationship between the comment and the
for code linking). Having small fine-tuning datasets is the scenario statement (e.g., textual similarity). For a fair comparison, we train
in which pre-training is known to bring major benefits [49]. As for the random forest on the same training set used for SALOON.
the learning rate, the best results are achieved with ISR-LR when
pre-training and with PD-LR when not pre-training. 3.2.1 Data Collection And Analysis. Concerning the comment clas-
To obtain the final model to use in SALOON, we fine-tuned sification task, we run SALOON on the test set and report the
the best performing model (i.e., pre-trained with ISR-LR) using an accuracy of the model in classifying comments representing “code
early-stopping strategy in which we evaluated the model on the summaries”. As for the code linking, we start computing the percent-
evaluation sets every 5k steps, stopping when no improvements age of correct predictions, namely cases in which all statements
were observed for 5 consecutive evaluations. We discuss the results linked to a comment in the test set match the ones in the oracle.
achieved by SALOON as compared to other baselines in Section 3.3. This means that a comment instance correctly linked to two out
of the three statements it documents is considered wrong. We also
compute the recall and precision of the techniques at statement-
level. The recall is computed as TP/(TP+FN), where TP represents
the set of code-to-comment links correctly identified by a technique
3.2 Study Design (i.e., a statement correctly linked to a comment) and FN are the
The goal of the study is to assess the accuracy of SALOON in the two set of correct code-to-comment links in the oracle missed by the
tasks it has been trained for: comment classification and code linking. approach. The precision is instead computed as TP/(TP+FP), with
The context is represented by the test sets reported in Table 2, FP representing the code-to-comment links wrongly reported by
featuring 1,203 instances for the task of comment classification and the approach (i.e., statements wrongly identified as linked to the
633 for the task of code linking. comment). We also statistically compare the techniques assuming a
Concerning the comment classification task, we do not compare significance level of 95%. We compare precision and recall using the
SALOON against any baseline, since our goal (i.e., identifying only Wilcoxon signed-rank test [65]. To control for multiple pairwise
code summaries) is quite specific of our work. Instead, we compare comparisons (e.g., SALOON’s precision compared with that of the
the performance of SALOON against the three following baselines three baselines), we adjust 𝑝-values with Holm’s correction [20].
for the task of code linking (the implementation of all baselines is We estimate the magnitude of the differences using the Cliff’s
publicly available [7]). Delta (𝑑), a non-parametric effect size measure [15]. We follow
Heuristic-1: blank line [8]. The first baseline is a straight- well-established guidelines to interpret the effect size: negligible
forward heuristic assuming that a given //inline comment docu- for |𝑑 | < 0.10, small for 0.10 ≤ |𝑑 | < 0.33, medium for 0.33 ≤ |𝑑 | <
ments all following statements until a blank line is reached. 0.474, and large for |𝑑 | ≥ 0.474 [15]. As for the percentage of correct
Heuristic-2: token-based string similarity [13]. The basic predictions, we pairwise compare them among the experimented
idea of this heuristic is that statements sharing terms with a code techniques, using the McNemar’s test [41], which is a proportion
comment are more likely to be documented by it. We use the token- test suitable to pairwise compare dichotomous results of two dif-
based string similarity by Fluri et al. [13] to compute the textual ferent treatments. We complement the McNemar’s test with the
similarity between each comment in the test set and all statements Odds Ratio (OR) effect size. Also in this case we use the Holm’s
in the method it belongs to. A statement is linked to the comment if correction procedure [20] to account for multiple comparisons.
its similarity with it is higher or equal than a threshold 𝜆. The simi-
larity is computed as the percentage of overlapping terms between 3.3 Results Discussion
the two strings (i.e., comment and statement), with the terms being As for the comment classification task, SALOON correctly classi-
extracted through space splitting. We experiment with different fies 78.05% (939/1,203) of instances. Out of the 633 code summary
values for 𝜆, going from 0.1 (i.e., 10% of terms are shared between comments present in the test set, 536 (84%) have been correctly
the two strings) to 0.9 at steps of 0.1. classified, while 97 have been mistakenly reported as other.
Towards Summarizing Code Snippets Using Pre-Trained Transformers ICPC 2024, April 2024, Lisbon, Portugal

Concerning the 570 “other” comments, SALOON correctly pre- Recall and precision values confirm the superiority of SALOON
dicted 403 (70%) of them, wrongly reporting 167 instances as code for the code linking task. In terms of recall, SALOON is able to
summary. This results in a recall=0.85 and precision=0.76 when correctly link 89% of statements in our dataset, achieving the best
identifying a comment as a code summary. This means that by performance among all the experimented techniques. While the
running our approach on the comments of a previously unseen blank-line approach achieves a similar recall (87%) it pays a much
software system, we can expect to identify 85% of code summaries higher price in terms of precision, with a 43% false positive rates
present in it accompanied, however, by 25% of false positives (i.e., as compared to the 14% of SALOON. Note that a high recall for
non code summary comments). this heuristic is expected, considering that it links all statements
following a comment until a blank line is found. The ML-Based
Table 3: T5 vs baselines on the code linking task technique can only predict half of the correct links (0.49) while
achieving a precision score of 0.58. Accordingly to our results,
Technique Correct Predictions Recall Precision
the token-based similarity heuristic does not represent a viable
Blank line [8] 0.20 0.87 0.57
solution for the code linking task: The best results are achieved
Token-based similarity [13]
𝜆 =0.1 0.03 0.62 0.33 when considering (𝜆=0.1) as a threshold, for which the technique
𝜆 =0.2 0.05 0.38 0.34 can ensure a recall of 0.62 and a precision of 0.33. Differences in
𝜆 =0.3 0.05 0.23 0.26
terms of recall and precision are always statistical significant (see
ML-based [8] 0.23 0.49 0.58
Table 4). The effect size is in most of cases medium or large, with the
SALOON 0.58 0.89 0.86
only exception of the recall test comparing T5 with the blank-line
baseline, for which a negligible effect size is reported.
Concerning the code linking task, Table 3 reports the correct To summarize, SALOON is able to identify comments represent-
predictions (i.e., for a given comment in our test set all linked ing code summaries with a recall of 0.85 and a precision of 0.76.
statements have been correctly identified), recall, and precision Also, it achieves state-of-the-art results in linking comments to the
achieved by SALOON and the three baselines. Table 4 reports the documented code, with a recall of 0.89 and a precision of 0.86. In
results of the statistical tests. For the Cliff’s Delta 𝑑 we use N, S, M, Section 4 we explain how we exploit this model to build a large-
and L to indicate its magnitude from Negligible to Large. scale dataset aimed at training a T5 fine-tuned for the task of code
Note that for the token-based string similarity baseline we re- snippet summarization.
port the results achieved with different values of 𝜆 (i.e., minimum
similarity threshold to link a code statement to a comment). 4 SNIPPETS SUMMARIZATION USING T5
While we also experimented with values going up to 0.9 [7],
We discuss how we trained a T5 model for the task of code snip-
the recall values were too close to 0 to consider these variants as
pet summarization (Section 4.1), the study we run to evaluate it
reasonable baselines.
(Section 4.2) and the achieved results (Section 4.3). We refer to the
Table 4: Code linking task: SALOON vs baselines snippet summarization approach as “STUNT” (SnippeT sUmma-
rizatioN using T5).
Comparison Metric p-value d OR
Correct Predictions <0.05 - 19.28
Blank line [8] vs SALOON Recall <0.05 -0.04 (N) - 4.1 Approach Description
Precision <0.05 -0.48 (L) -
We rely on the same T5 architecture described in Section 3.1 and we
Correct Predictions <0.05 - 70.80
Token sim.(0.1) [13] vs SALOON Recall <0.05 -0.45 (M) -
reuse the same pre-trained model we built for the comment classifi-
Precision <0.05 -0.75 (L) - cation and code linking tasks. Indeed, as explained in Section 3.1.1,
Correct Predictions <0.05 - 37.77 we pre-trained the model on a dataset composed by ∼1.5M Java
Token sim.(0.2) [13] vs SALOON Recall <0.05 -0.66 (L) - methods and their inner comments and ∼370k Javadoc comments.
Precision <0.05 -0.68 (L) -
Thus, T5 has been pre-trained to acquire knowledge about the two
Correct Predictions <0.05 - 38.00 “target languages” relevant for the summarization task as well (i.e.,
Token sim.(0.3) [13] vs SALOON Recall <0.05 -0.80( L) -
Precision <0.05 -0.73 (L) - Java code and technical language used to summarize it). We detail
Correct Predictions <0.05 - 15.80 the fine-tuning dataset and the training procedure.
ML-Based [8] vs SALOON Recall <0.05 -0.49 (L) -
Precision <0.05 -0.33 (M) - 4.1.1 Fine-tuning Dataset. We used the GHS tool by Dabic et al. [11]
to query GitHub for all public non-forked Java projects with min-
SALOON predicts all statements linked to a given comment in imum 50 commits, 5 contributors, and 10 stars. The idea of these
58% of cases, against the 23% achieved by the best-performing base- filters was to remove toy/personal projects while still obtaining a
line (ML-based). The blank-line technique achieves 20% of correct large set of projects to provide as input to SALOON with the goal
predictions. of identifying comments representing summaries and linking them
The results of the statistical tests confirm the better performance to the relevant code. We cloned 10k of the 18.7k projects returned
ensured by SALOON in terms of correct predictions: McNemar’s by our query and extracted their methods using srcML [10].
test always indicates significant differences in terms of correct pre- We excluded all methods longer than 512 tokens and removed all
dictions accompanied by ORs indicating that SALOON has between duplicates, obtaining a set of methods 𝑆. We also removed duplicates
15.80 to 70.80 higher odds of providing a correct prediction against between our pre-training dataset and 𝑆 and between our manually
the baselines. labeled dataset (Section 2.2) and 𝑆.
ICPC 2024, April 2024, Lisbon, Portugal Antonio Mastropaolo, Matteo Ciniselli, Luca Pascarella, Rosalia Tufano, Emad Aghajani, and Gabriele Bavota

Concerning the removal of duplicates between the pre-training The BLEU-4 variant computes the BLEU score by considering the
dataset and 𝑆, this was needed since 𝑆 is our starting point to overlap of 4-grams between the generated text (i.e., the synthesized
build the fine-tuning dataset for the snippet summarization task snippet summary) and the target text (i.e., the summary written
from which we will also extract the test set on which STUNT will by the original developers). This metric has been used by most
be evaluated. Thus, we ensure that STUNT is not evaluated on of the previous work on code summarization (see e.g., [4, 19, 21–
already seen instances. As for the removal of duplicates between the 24, 27–29, 31, 57, 59, 61, 64, 69, 71]). Each of the four models has
manually labeled dataset and 𝑆, this is due to the fact that SALOON been trained for 100k steps before its evaluation. C-LR (i.e., con-
(i.e., our approach for comment classification and linking) has been stant learning rate) provided the best performance. Data about this
trained on those instances and we will run it on 𝑆 to build the fine- evaluation are available in our replication package [7].
tuning dataset for STUNT (i.e., for code summarization). Running Once identified the best T5 variant, we fine-tuned it for up to
SALOON on already seen instances would inflate its performance, 500k steps, using an early-stopping strategy to tame over-fitting. To
and not provide a realistic picture of what can be achieved by this aim, we monitored the BLEU-4 score achieved on the evaluation
training STUNT on a dataset automatically built using SALOON. set every 5k steps, stopping the training when no improvements
From the remaining methods, we extracted all inner comments, were observed after 5 consecutive evaluations.
filtering out those shorter than 5 words (unlikely to represent a
meaningful code summary). As done in previous code summariza-
tion works [30], we lowercased and stemmed the comments (using 4.2 Study Design
the spaCy NLP library [2]). Then, for each comment 𝐷𝑖 extracted The goal is to assess the accuracy of STUNT for snippet summa-
from a method 𝑀 𝑗 we created an instance 𝑀 𝑗,𝐷𝑖 in which 𝑀 𝑗 ’s rization. The context is represented by (i) 55,475 ⟨𝑀 𝑗,𝐷𝐶 , 𝐷𝑖 ⟩ pairs
code features special tokens <comment></comment> to surround identified by SALOON as described in Section 4.1.1 and belong-
the comment of interest (𝐷𝑖 ). This means that if 𝑀 𝑗 features three ing to the test set, and (ii) the test set made publicly available by
inner comments, three 𝑀 𝑗,𝐷𝑖 instances will be created, each having Huang et al. [24] when presenting RL-BlockCom, the state-of-the-art
a different comment (𝐷𝑖 ) “tagged”. This format is the one expected snippet summarization approach discussed in Section 1.
by SALOON to automatically (i) classify 𝐷𝑖 as code summary or We assess the performance of STUNT against an information
other, and (ii) link 𝐷𝑖 to the relevant code statements. retrieval (IR)-based technique (i.e., IR-Jaccard) and RL-BlockCom.
The above-described process resulted in 2,210,602 𝑀 𝑗,𝐷𝑖 instances To explain the basic idea behind the IR-based baseline let us remind
that we provided as input to SALOON, which classified 907,660 that both our training and test set are composed by ⟨𝑀 𝑗,𝐷𝐶 , 𝐷𝑖 ⟩
of them as code summary. Among these, SALOON automatically pairs. Given a pair in the test set, the baseline retrieves in the
linked code statements to the code summaries in ∼85% of cases training set the pair having the 𝐷𝐶 snippet being the most similar
(776,531). These instances are ⟨𝑀 𝑗,𝐷𝐶 , 𝐷𝑖 ⟩ pairs, where 𝑀 𝑗,𝐷𝐶 rep- to the one in the test set pair. This means that this pair contains a
resents the method 𝑀 𝑗 with special tokens <start><end> surround- documented snippet that is very similar to the one in the test set
ing the statements (𝐷𝐶) documented by 𝐷𝑖 . for which we have to generate a code summary. Once identified
If more non-contiguous statements are documented, multiple the most similar snippet in the training set, the IR-based technique
<start><end> pairs are injected in 𝑀 𝑗 . These pairs are those needed reuses its description to document the instance in the test set. This
to fine-tune STUNT for the task of snippet summarization: the input baseline serves as a representative of works using IR to retrieve
provided to the model is 𝑀 𝑗,𝐷𝐶 (i.e., a snippet to document) and similar comments from a given dataset, including e.g., [67].
the expected output is the documentation 𝐷𝑖 . To avoid favoring the IR: Jaccard index [17]. IR-Jaccard identifies the most similar
model during testing, we also removed all duplicates at snippet-level snippet using the Jaccard similarity index. The latter considers the
granularity. This means that if we have in our dataset two different overlapping between two sets of unique elements, representing in
methods containing the same 𝐷𝐶 (i.e., the same code snippet to our case the tokens composing the documented code (𝐷𝐶) in the
document), we only keep one of them. Also, being SALOON an test instance and in each of the training instances. Indeed, we need
automated approach, it is expected to produce wrong instances (e.g., to compare each instance in the test set to all those in the training
comments linked to wrong statements) which, in turn, will penalize set to find the most similar one. The similarity is computed as the
the performance of STUNT. By manually inspecting a sample of percentage of overlapping tokens between the two sets.
the pairs in our dataset, we noticed that one clear case of wrong An additional baseline for STUNT is RL-BlockCom by Huang
instances are those in which the model had very low confidence et al. [24]. Despite the code being available, we did not manage to
in identifying the documented statements thus producing random re-train their approach on our dataset. We contacted the authors
symbols rather than the expected documented line numbers. We asking for help without, however, receiving answer. Thus, as an
automatically remove those instances, obtaining a set of 554,748 alternative form of comparison, we thought about training and
pairs, split into 80% training (443,798), 10% evaluation (55,475), and testing STUNT on their dataset, which is publicly available, and
10% testing (55,475). then comparing the summaries generated by STUNT with those
generated by RL-BlockCom. Unfortunately, the authors did not make
4.1.2 Training Procedure and Hyperparameters Tuning. As explained, the summaries generated by their approach publicly available. The
we started from the already pre-trained T5 model. We then followed only viable form of comparison we found was to (i) re-train STUNT
the same hyperparameters tuning discussed in Section 3.1.3, assess- on the training dataset made available by Huang et al. [24] and
ing the performance of four different learning scheduler on the used to train RL-BlockCom; (ii) use this trained version of STUNT
evaluation set using the BLEU-4 score [44] as performance metric. to generate predictions on the same test set on which RL-BlockCom
Towards Summarizing Code Snippets Using Pre-Trained Transformers ICPC 2024, April 2024, Lisbon, Portugal

has been evaluated; (iii) use the evaluation scripts made available by Table 5: BLEU scores: STUNT vs RL-BlockCom [24]
Huang et al. for the computation of the sentence-level BLEU score;
and (iv) compare the achieved results with those reported in their RL-Com STUNT
paper. Indeed, not having access to the summaries generated by
BLEU-1 32.18 34.17
RL-BlockCom does not allow us to double-check the data reported BLEU-2 25.98 31.09
in the original paper nor to compute additional metrics besides BLEU-3 24.36 30.63
those used by the authors (BLEU). Note also that the training/test BLEU-4 24.28 31.22
datasets shared by Huang et al. feature pairs ⟨𝐷𝐶, 𝐷𝑖 ⟩ as compared
to our ⟨𝑀 𝑗,𝐷𝐶 , 𝐷𝑖 ⟩ pairs. This means that STUNT cannot exploit
the contextual information of the method 𝑀 𝑗 when generating the Table 6 compares STUNT against IR-Jaccard on the large-scale
predictions on their dataset. dataset we built. Accordingly to all metrics used in our evaluation,
the gap in performance between STUNT and the baseline (i.e., IR-
4.2.1 Data Collection And Analysis. To compare the performance Jaccard) is substantial, with at least a +11 in terms of BLEU-4, a +12
of our model against the two IR-based baselines, we exploit three in terms of ROUGE-LCS f-measure, and a +16 in terms of METEOR
metrics explained in the following. score. As observed by Roy et al. [51], METEOR is “extremely reliable
Out of those, only BLEU has been used in the comparison with for differences greater than 2 points” in assessing code summarization
RL-BlockCom for the reasons previously explained. quality as perceived by humans (i.e., also humans are likely to prefer
BLEU [44] assesses the quality of the automatically generated STUNT’s summaries over those generated by the baselines).
summaries by assigning a score between 0 and 1. In our case, 1 indi-
cates that the natural language summary automatically generated
Table 6: Evaluation Metrics: STUNT vs IR-Jaccard
is identical to the one originally written by the developer. Since in
the test set we built there are no summaries shorter than 4 words,
we use the BLEU-4 variant in the comparison with the IR-based IR-Jaccard STUNT
baselines. When comparing with RL-BlockCom on their test set, we BLEU-4 [44] 27.43 38.42
also compute BLEU-1, BLEU-2 and BLEU-3 as done by Huang et al. ROUGE-LCS [34]
[24]. 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 23.00 34.21
METEOR [6] is a metric based on the harmonic mean of uni- 𝑟𝑒𝑐𝑎𝑙𝑙 23.04 37.39
gram precision and recall (the recall is weighted higher than the 𝑓 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 22.33 34.57
precision). Compared to BLEU, METEOR uses stemming and syn- METEOR [6] 25.04 41.75
onyms matching to better match the human perception of sentences
with similar meanings. Values range from 0 to 1, with 1 being a
The statistical analyses presented in Table 7 validate STUNT’s
perfect match.
superior performance compared to IR-Jaccard. Notably, we ob-
ROUGE [34] is a set of metrics focusing on automatic summa-
serve significant 𝑝-values and medium effect sizes for BLEU-4 and
rization tasks. We use the ROUGE-LCS (Longest Common Subse-
ROUGE-LCS (f-measure), while METEOR demonstrates a large
quence) variant, which identifies longest co-occurring in sequence
effect size.
n-grams. ROUGE-LCS returns three values, the recall computed as
LCS(X,Y)/length(X), the precision computed as LCS(X,Y)/length(Y),
Table 7: Statistical Tests: STUNT vs IR-Jaccard
and the F-measure computed as the harmonic mean of recall and
precision where X and Y represent two sequences of tokens.
Comparison Metric p-value d
We also statistically compare the different approaches assuming
a significance level of 95%. Also in this case we use the Wilcoxon BLEU-4 <0.001 -0.451 (M)
IR (Jaccard) vs STUNT ROUGE-LCS (f-measure) <0.001 -0.471 (M)
signed-rank test [65], adjusting 𝑝-values to account for multiple METEOR <0.001 -0.474 (L)
comparisons (Holm’s correction procedure [20]) and the Cliff’s
Delta (𝑑) as effect size measure [15]. The statistical comparison was
not possible with RL-BlockCom since we only had access to the While the metrics we computed provide a fair comparison among
overall BLEU scores reported in the paper (i.e., the BLEU scores for the experimented techniques, they do not give a clear idea of the
each generated summary were not available). quality of the summaries generated by STUNT. To this aim two of
the authors manually inspected 384 randomly selected summaries
generated by STUNT for which the generated text was different
4.3 Results from the target summary (i.e., the one written by developers). These
Table 5 compares STUNT and RL-BlockCom, using the values re- are cases that in a “binary quantitative evaluation” would be clas-
ported in the paper by Huang et al. [24] as BLEU scores for RL- sified as wrong predictions. The authors independently classified
BlockCom. STUNT achieves better performance for all BLEU scores, each summary as meaningful or not meaningful, based on the ability
outperforming the state-of-the-art approach by a large margin (e.g., of the summary to properly describe the documented snippet. In
+7 points of BLEU-4). A deeper comparison of the two techniques the labeling, the two involved authors achieved a Cohen’s kappa
is not possible since the summaries generated by RL-BlockCom are [9] of 0.61, indicating a substantial agreement when measuring
not available. inter-rater reliability for categorical items.
ICPC 2024, April 2024, Lisbon, Portugal Antonio Mastropaolo, Matteo Ciniselli, Luca Pascarella, Rosalia Tufano, Emad Aghajani, and Gabriele Bavota

Conflicts, arisen in 71 cases and have been solved through open While this is confirmed in our dataset, we also observed that it
discussion among the authors. We classified 224 summaries as mean- is far from trivial to assess the exact set of (following) lines actually
ingful, with some of them representing even a better summary than documented by the comment due to the lack of a clear separator
the one manually written by the original developers. For example, isolating the documented from the undocumented code.
we found the comment if we have a frontend then we need Fluri et al. [13], while studying the co-evolution of code and
to get the action list to be more meaningful and detailed than comments, suggested that token-based similarity between the code
the exit if we do not have a frontend written the developer. and the comment can be used to identify documented statements.
However, we also want to highlight the ∼41% (160) of automatically Such an intuition has also been echoed by McBurney and McMillan
generated summaries which were not meaningful and that stress [40]. As shown in our study, our DL-based approach substantially
how far we still are from obtaining a code summarizer being accu- outperforms similarity-based heuristics.
rate enough to be deployed to developers (i.e., generating correct Finally, Chen et al. [8] recently proposed a machine-learning
summaries in most of cases). based method for the automatic identification of code comments
scope. Such an approach has been extensively described in Sec-
5 THREATS TO VALIDITY tion 3.2 as one of the baselines we compared with. Our approach
We discuss the threats that could affect the validity of our findings. outperforms this approach as well.

Internal Validity. Building our dataset of classified and linked


code comments (Section 2) involved a certain degree of subjectivity.
To partially address this threat, two evaluators independently as- 6.2 Code Summarization
sessed each instance and a third one solved conflicts when needed. Several techniques have been proposed to automatically summarize
Still, imprecisions are possible. source code [73]. We focus our discussion on (i) techniques aimed at
We performed a limited hyperparameters tuning of the T5 mod- documenting code snippets (regardless of the underlying techniques
els, only experimenting with different learning rates. For example, used) and (ii) DL-based approaches (regardless of the target code
we did not change the number of layers, but relied on the default granularity).
T5𝑠𝑚𝑎𝑙𝑙 architecture by Raffel et al. [47]. Better results could be Most of the techniques targeting the documentation of code
achieved with additional tuning. Also, relying on pre-trained code snippets are based on IR. Representative works in this area are
models like CodeT5 [63], might produce better results. CodeInsight [48], ColCom [66, 67], and ADANA [3]. The IR baselines
Construct Validity. When experimenting with SALOON, we we exploit is Section 4.2 are representative of these works.
compared its performance with the technique by Chen et al. [8]. Another family of techniques related to code snippets documen-
However, since their approach is not publicly available, we had to tation relies on manually defined templates to describe high-level
reimplement it following the paper’s description. actions performed within functions. Seminal work in this area are
We release our implementation [7]. Still related to the used from Sridhara et al. [53] and Wang et al. [60]. These approaches,
baselines, as explained in Section 4.2 we did not manage to com- while valuable, cannot generalize to all combinations of code state-
pare STUNT (our approach for snippet summarization) with RL- ments one could expect to find since they are based on predefined
BlockCom [24] on our dataset. At least, we presented a comparison templates. For this reason, data-driven techniques exploiting DL
performed on the dataset released by the authors. have been proposed [24, 72]. When it comes to snippet-level granu-
External Validity. The manually built dataset represents the larity, RL-BlockCom [24] represents the state-of-the-art. As shown,
obvious bottleneck in terms of generalizability, since it is based on our approach performs substantially better than RL-BlockCom.
the analysis of “only” 1,508 Java files and also capped our train- Most of the other DL-based techniques proposed in the literature
ing/evaluation of SALOON. Still, building such a dataset costed over focused on documenting entire functions. Liang et al. [33] presented
815 man-hours. Also, we did not compare our technique against Code-RNN, a Recursive Neural Network exploiting a GRU cell
general purpose large language models such as ChatGPT [1], since (Code-GRU) specifically designed for code comments generation.
designing a fair evaluation is challenging due to the unknown train- The authors show that their approach can achieve higher ROUGE
ing set behind these LLMs. For example, we could have tested the score [34] as compared to vanilla DL models not tailored for source
ability of ChatGPT to summarize specific snippets which, however, code. Hu et al. [21] built a dataset of ⟨method, javadoc⟩ pairs from
were part of its training set together with their related comment. ∼9k Java projects to train a Deep Neural Network (DNN) aimed
at documenting Java methods. The authors used the BLEU-4 score
6 RELATED WORK [45] to compare the summaries generated by their approach to
those of the neural attention model by Iyer et al. [26], showing the
We discuss techniques for (i) the automated linking of code to superiority of the proposed technique.
comments, and (ii) code summarization. While previous works represented code as a stream of tokens,
other authors combined such a representation with one capturing
6.1 Linking Documentation to Code AST information [30, 58]. For example, LeClair et al. [30] showed
Some works, while studying code comments from different per- how exploiting AST-based information allows to improve the per-
spectives, came up with possible heuristics to identify the scope of formance achieved by both Hu et al. [21] and Iyer et al. [26]. The
code comments. Haouari et al. [18] showed that code comments work by LeClair et al. has been later on extended and improved by
frequently document the following code. Haque et al. [19], which provide as input to the model additional
Towards Summarizing Code Snippets Using Pre-Trained Transformers ICPC 2024, April 2024, Lisbon, Portugal

information related to the “file context” of the method to summa- (2009), 367–394.
rize. They show that such a contextual information helps to further [15] Robert J Grissom and John J Kim. 2005. Effect sizes for research: A broad practical
approach. Lawrence Erlbaum Associates Publishers.
boost performance. [16] S. Haiduc, J. Aponte, L. Moreno, and A. Marcus. 2010. On the Use of Automated
Zhang [70] showed that combining IR and DL techniques it is Text Summarization Techniques for Summarizing Source Code. In 2010 17th
Working Conference on Reverse Engineering. 35–44.
possible to boost the performance of function-level code summa- [17] John M Hancock. 2004. Jaccard distance (Jaccard index, Jaccard similarity coeffi-
rization. Our work focuses on the related but different problem of cient). Dictionary of Bioinformatics and Computational Biology (2004).
snippet summarization that, as explained, poses different challenges [18] Dorsaf Haouari, Houari Sahraoui, and Philippe Langlais. 2011. How good is
your comment? a study of comments in java programs. In 2011 International
especially in the building of the training data. Symposium on Empirical Software Engineering and Measurement. IEEE, 137–146.
[19] Sakib Haque, Alexander LeClair, Lingfei Wu, and Collin McMillan. 2020. Improved
Automatic Summarization of Subroutines via Attention to File Context (MSR ’20).
7 CONCLUSIONS 300–310.
We targeted the problem of code snippet summarization, presenting [20] Sture Holm. 1979. A simple sequentially rejective multiple test procedure. Scan-
dinavian journal of statistics (1979), 65–70.
(i) a manually labeled dataset of ∼6.6k code comments classified in [21] Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2018. Deep Code Comment
terms of information they provide (e.g., code summary) and linked Generation (ICPC ’18). Association for Computing Machinery, 200–210.
to the code statements they document; (ii) SALOON, a T5 model [22] Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2020. Deep code comment
generation with hybrid lexical and syntactical information. Springer Empirical
trained on our manually built dataset to automatically classify and Software Engineering 25 (2020), 2179–2217.
link inner comments in Java code; and (iii) STUNT, a T5 model [23] Xing Hu, Ge Li, Xin Xia, David Lo, Shuai Lu, and Zhi Jin. 2018. Summarizing
source code with transferred api knowledge. (2018).
trained on a large-scale dataset of documented code snippets auto- [24] Yuan Huang, Shaohao Huang, Huanchao Chen, Xiangping Chen, Zibin Zheng,
matically created by running SALOON on 10k Java projects. Xiapu Luo, Nan Jia, Xinyu Hu, and Xiaocong Zhou. 2020. Towards automati-
We achieved promising results for both code linking and snippet cally generating block comments for code snippets. Information and Software
Technology 127 (2020), 106373.
summarization, pointing however to the need for research in this [25] Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc
field. Our dataset and our models, publicly released [7], represent a Brockschmidt. 2019. CodeSearchNet challenge: Evaluating the state of semantic
step in that direction. code search. arXiv preprint arXiv:1909.09436 (2019).
[26] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016.
Summarizing Source Code using a Neural Attention Model. In Proceedings of the
ACKNOWLEDGMENT 54th Annual Meeting of the Association for Computational Linguistics (Volume 1:
Long Papers). 2073–2083.
This project has received funding from the European Research [27] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016.
Council (ERC) under the European Union’s Horizon 2020 research Summarizing source code using a neural attention model. In Proceedings of the
54th Annual Meeting of the Association for Computational Linguistics (Volume 1:
and innovation programme (grant agreement No. 851720). Long Papers). 2073–2083.
[28] Alexander LeClair, Aakash Bansal, and Collin McMillan. 2021. Ensemble Models
for Neural Source Code Summarization of Subroutines. In 2021 IEEE International
REFERENCES Conference on Software Maintenance and Evolution (ICSME). 286–297. https:
[1] [n.d.]. ChatGPT https://openai.com/blog/chatgpt. //doi.org/10.1109/ICSME52107.2021.00032
[2] [n.d.]. Spacy. https://spacy.io. [29] Alexander LeClair, Sakib Haque, Lingfei Wu, and Collin McMillan. 2020. Im-
[3] E. Aghajani, G. Bavota, M. Linares-Vásquez, and M. Lanza. 2021. Automated proved code summarization via a graph neural network. In Proceedings of the
Documentation of Android Apps. IEEE Transactions on Software Engineering 47, 28th international conference on program comprehension. 184–195.
1 (2021), 204–220. [30] Alexander LeClair, Siyuan Jiang, and Collin McMillan. 2019. A Neural Model
[4] Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2020. for Generating Natural Language Summaries of Program Subroutines (ICSE ’19).
A transformer-based approach for source code summarization. arXiv preprint 795–806.
arXiv:2005.00653 (2020). [31] Alexander LeClair, Siyuan Jiang, and Collin McMillan. 2019. A Neural Model
[5] Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016. A Convolutional for Generating Natural Language Summaries of Program Subroutines. In 2019
Attention Network for Extreme Summarization of Source Code. In International IEEE/ACM 41st International Conference on Software Engineering (ICSE). 795–806.
Conference on Machine Learning (ICML). https://doi.org/10.1109/ICSE.2019.00087
[6] Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT [32] Alexander LeClair and Collin McMillan. 2019. Recommendations for datasets for
Evaluation with Improved Correlation with Human Judgments. In Proceedings source code summarization. arXiv preprint arXiv:1904.02660 (2019).
of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine [33] Yuding Liang and Kenny Q. Zhu. 2018. Automatic Generation of Text Descriptive
Translation and/or Summarization. Association for Computational Linguistics, Comments for Code Blocks. In Proceedings of the Thirty-Second AAAI Confer-
Ann Arbor, Michigan, 65–72. https://aclanthology.org/W05-0909 ence on Artificial Intelligence and Thirtieth Innovative Applications of Artificial
[7] Double Blind. [n.d.]. https://snippets-summarization.github.io. Intelligence Conference and Eighth AAAI Symposium on Educational Advances
[8] Huanchao Chen, Yuan Huang, Zhiyong Liu, Xiangping Chen, Fan Zhou, and in Artificial Intelligence (AAAI’18/IAAI’18/EAAI’18). AAAI Press, Article 641,
Xiaonan Luo. 2019. Automatically detecting the scopes of source code comments. 8 pages.
Journal of Systems and Software 153 (2019), 45–63. [34] Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries.
[9] Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational In Text summarization branches out. 74–81.
and psychological measurement 20, 1 (1960), 37–46. [35] M. Linares-Vásquez, B. Li, C. Vendome, and D. Poshyvanyk. 2015. How do
[10] Michael L Collard, Michael John Decker, and Jonathan I Maletic. 2013. srcml: An Developers Document Database Usages in Source Code?. In 2015 30th IEEE/ACM
infrastructure for the exploration, analysis, and manipulation of source code: A International Conference on Automated Software Engineering (ASE). 36–41. https:
tool demonstration. In 2013 IEEE International Conference on Software Maintenance. //doi.org/10.1109/ASE.2015.67
IEEE, 516–519. [36] Zhongxin Liu, Xin Xia, Ahmed E. Hassan, David Lo, Zhenchang Xing, and Xinyu
[11] Ozren Dabic, Emad Aghajani, and Gabriele Bavota. 2021. Sampling Projects in Wang. 2018. Neural-Machine-Translation-Based Commit Message Generation: How
GitHub for MSR Studies. In 18th IEEE/ACM International Conference on Mining Far Are We? Association for Computing Machinery, New York, NY, USA, 373–384.
Software Repositories, MSR 2021. IEEE, 560–564. https://doi.org/10.1145/3238147.3238190
[12] Sergio Cozzetti B. de Souza, Nicolas Anquetil, and Káthia M. de Oliveira. 2005. A [37] Antonio Mastropaolo, Luca Pascarella, and Gabriele Bavota. 2022. Using Deep
Study of the Documentation Essential to Software Maintenance. In International Learning to Generate Complete Log Statements. In 44th IEEE/ACM 44th Interna-
Conference on Design of Communication. 68–75. tional Conference on Software Engineering, ICSE 2022, Pittsburgh, PA, USA, May
[13] Beat Fluri, Michael Wursch, and Harald C. Gall. 2007. Do Code and Comments 25-27, 2022. ACM, 2279–2290. https://doi.org/10.1145/3510003.3511561
Co-Evolve? On the Relation between Source Code and Comment Changes. In [38] Antonio Mastropaolo, Simone Scalabrino, Nathan Cooper, David Nader Palacio,
14th Working Conference on Reverse Engineering (WCRE 2007). 70–79. Denys Poshyvanyk, Rocco Oliveto, and Gabriele Bavota. 2021. Studying the
[14] Beat Fluri, Michael Würsch, Emanuel Giger, and Harald C. Gall. 2009. Analyzing usage of text-to-text transfer transformer to support code-related tasks. In 2021
the Co-evolution of Comments and Source Code. Software Quality Journal 17, 4 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE,
ICPC 2024, April 2024, Lisbon, Portugal Antonio Mastropaolo, Matteo Ciniselli, Luca Pascarella, Rosalia Tufano, Emad Aghajani, and Gabriele Bavota

336–347. [60] Xiaoran Wang, Lori Pollock, and K Vijay-Shanker. 2017. Automatically generating
[39] P. W. McBurney and C. McMillan. 2016. Automatic Source Code Summarization natural language descriptions for object-related statement sequences. In 2017 IEEE
of Context for Java Methods. IEEE Transactions on Software Engineering 42, 2 24th International Conference on Software Analysis, Evolution and Reengineering
(2016), 103–119. (SANER). IEEE, 205–216.
[40] Paul W McBurney and Collin McMillan. 2016. An empirical study of the textual [61] Yanlin Wang, Ensheng Shi, Lun Du, Xiaodi Yang, Yuxuan Hu, Shi Han, Hongyu
similarity between source code and source code summaries. Empirical Software Zhang, and Dongmei Zhang. 2021. CoCoSum: Contextual Code Summarization
Engineering 21, 1 (2016), 17–42. with Multi-Relational Graph Neural Network. arXiv preprint arXiv:2107.01933
[41] Quinn McNemar. 1947. Note on the sampling error of the difference between (2021).
correlated proportions or percentages. Psychometrika 12, 2 (1947), 153–157. [62] Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi. 2021. CodeT5:
[42] Roberto Minelli, Andrea Mocci, and Michele Lanza. 2015. I know what you did Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Un-
last summer: an investigation of how developers spend their time. In Proceedings derstanding and Generation. In Proceedings of the 2021 Conference on Empir-
of the 2015 IEEE 23rd International Conference on Program Comprehension, ICPC ical Methods in Natural Language Processing. Association for Computational
2015, Florence/Firenze, Italy, May 16-24, 2015, Andrea De Lucia, Christian Bird, Linguistics, Online and Punta Cana, Dominican Republic, 8696–8708. https:
and Rocco Oliveto (Eds.). IEEE Computer Society, 25–35. //doi.org/10.18653/v1/2021.emnlp-main.685
[43] Laura Moreno, Jairo Aponte, Giriprasad Sridhara, Andrian Marcus, Lori Pollock, [63] Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. Codet5: Identifier-
and K Vijay-Shanker. 2013. Automatic generation of natural language summaries aware unified pre-trained encoder-decoder models for code understanding and
for java classes. In 2013 21st International Conference on Program Comprehension generation. arXiv preprint arXiv:2109.00859 (2021).
(ICPC). IEEE, 23–32. [64] Bolin Wei, Yongmin Li, Ge Li, Xin Xia, and Zhi Jin. 2020. Retrieve and refine:
[44] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a exemplar-based neural comment generation. In 2020 35th IEEE/ACM International
method for automatic evaluation of machine translation. In Proceedings of the Conference on Automated Software Engineering (ASE). IEEE, 349–360.
40th annual meeting of the Association for Computational Linguistics. 311–318. [65] Frank Wilcoxon. 1945. Individual Comparisons by Ranking Methods. Biometrics
[45] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: Bulletin 1, 6 (1945), 80–83.
A Method for Automatic Evaluation of Machine Translation. In Proceedings of [66] Edmund Wong, Taiyue Liu, and Lin Tan. [n.d.]. CloCom: Mining existing source
the 40th Annual Meeting on Association for Computational Linguistics (ACL ’02). code for automatic comment generation. In Software Analysis, Evolution and
311–318. Reengineering (SANER), 2015. 380–389.
[46] Luca Pascarella and Alberto Bacchelli. 2017. Classifying code comments in Java [67] Edmund Wong, Jinqiu Yang, and Lin Tan. 2013. Autocomment: Mining question
open-source software systems. In 2017 IEEE/ACM 14th International Conference and answer sites for automatic comment generation. In 2013 28th IEEE/ACM
on Mining Software Repositories (MSR). IEEE, 227–237. International Conference on Automated Software Engineering (ASE). IEEE, 562–
[47] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, 567.
Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits [68] X. Xia, L. Bao, D. Lo, Z. Xing, A. E. Hassan, and S. Li. 2018. Measuring Program
of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Comprehension: A Large-Scale Field Study with Professionals. IEEE Transactions
Learning Research 21, 140 (2020), 1–67. http://jmlr.org/papers/v21/20-074.html on Software Engineering (2018), 951–976.
[48] M. M. Rahman, C. K. Roy, and I. Keivanloo. 2015. Recommending insightful [69] Wei Ye, Rui Xie, Jinglei Zhang, Tianxiang Hu, Xiaoyin Wang, and Shikun Zhang.
comments for source code using crowdsourced knowledge. In 2015 IEEE 15th 2020. Leveraging code generation to improve code retrieval and summarization
International Working Conference on Source Code Analysis and Manipulation via dual learning. In Proceedings of The Web Conference 2020. 2309–2319.
(SCAM). 81–90. [70] Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, and Xudong Liu. 2020.
[49] Romain Robbes and Andrea Janes. 2019. Leveraging Small Software Engineering Retrieval-based Neural Source Code Summarization. In 2020 IEEE/ACM 42nd
Data Sets with Pre-Trained Neural Networks. In 2019 IEEE/ACM 41st International International Conference on Software Engineering (ICSE). 1385–1397.
Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER). [71] Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, and Xudong Liu. 2020.
29–32. Retrieval-based neural source code summarization. In 2020 IEEE/ACM 42nd Inter-
[50] Paige Rodeghero, Siyuan Jiang, Ameer Armaly, and Collin McMillan. 2017. De- national Conference on Software Engineering (ICSE). IEEE, 1385–1397.
tecting User Story Information in Developer-Client Conversations to Generate [72] Wenhao Zheng, Hong-Yu Zhou, Ming Li, and Jianxin Wu. 2019. CodeAttention:
Extractive Summaries (ICSE 2017). 49–59. translating source code to comments by exploiting the code constructs. Frontiers
[51] Devjeet Roy, Sarah Fakhoury, and Venera Arnaoudova. 2021. Reassessing Auto- Comput. Sci. 13, 3 (2019), 565–578.
matic Evaluation Metrics for Code Summarization Tasks. In Proceedings of the [73] Yuxiang Zhu and Minxue Pan. 2019. Automatic code summarization: A systematic
29th ACM Joint Meeting on European Software Engineering Conference and Sym- literature review. arXiv preprint arXiv:1909.04352 (2019).
posium on the Foundations of Software Engineering (Athens, Greece) (ESEC/FSE
2021). Association for Computing Machinery, New York, NY, USA, 1105–1116.
https://doi.org/10.1145/3468264.3468588
[52] D. Spinellis. 2010. Code Documentation. IEEE Software 27, 4 (July 2010), 18–19.
https://doi.org/10.1109/MS.2010.95
[53] Giriprasad Sridhara, Lori Pollock, and K Vijay-Shanker. 2011. Automatically
detecting and describing high level actions within methods. In 2011 33rd Interna-
tional Conference on Software Engineering (ICSE). IEEE, 101–110.
[54] Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin
White, and Denys Poshyvanyk. 2019. Learning how to mutate source code from
bug-fixes. In 2019 IEEE International Conference on Software Maintenance and
Evolution (ICSME). IEEE, 301–312.
[55] Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin
White, and Denys Poshyvanyk. 2019. An Empirical Study on Learning Bug-Fixing
Patches in the Wild via Neural Machine Translation. ACM Trans. Softw. Eng.
Methodol. 28, 4 (2019), 19:1–19:29.
[56] Rosalia Tufano, Simone Masiero, Antonio Mastropaolo, Luca Pascarella, Denys
Poshyvanyk, and Gabriele Bavota. 2022. Using Pre-Trained Models to Boost Code
Review Automation. In 44th IEEE/ACM 44th International Conference on Software
Engineering, ICSE 2022, Pittsburgh, PA, USA, May 25-27, 2022. ACM, 2291–2302.
https://doi.org/10.1145/3510003.3510621
[57] Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, Haochao Ying, Jian Wu, and
Philip S Yu. 2018. Improving automatic source code summarization via deep rein-
forcement learning. In Proceedings of the 33rd ACM/IEEE international conference
on automated software engineering. 397–407.
[58] Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, Haochao Ying, Jian Wu, and
Philip S. Yu. 2018. Improving Automatic Source Code Summarization via Deep
Reinforcement Learning. 397?407.
[59] Wenhua Wang, Yuqun Zhang, Yulei Sui, Yao Wan, Zhou Zhao, Jian Wu, Philip S.
Yu, and Guandong Xu. 2022. Reinforcement-Learning-Guided Source Code
Summarization Using Hierarchical Attention. IEEE Transactions on Software
Engineering 48, 1 (2022), 102–119. https://doi.org/10.1109/TSE.2020.2979701

You might also like