SP-Automatic Documentation Generation via Source Code Summarization of Method Context
SP-Automatic Documentation Generation via Source Code Summarization of Method Context
4. APPROACH
This section describes the details of our approach, includ-
ing each step of our natural language generation system.
Generally speaking, our approach creates a summary
of a given method in three steps: 1) use PageRank to Figure 2: Overview of our approach.
discover the most-important methods in the given method’s
context, 2) use data from SWUM to extract keywords about
the actions performed by those most-important methods,
and 3) use a custom NLG system to generate English sen- extracted by SWUM from the method’s signature. Our sys-
tences describing for what the given method is used. tem makes a simplifying assumption that all methods per-
The architecture of our approach is shown in Figure 2. In form some action on some input. If the keyword associated
theory our system could summarize functions in many lan- with the input is labeled as a noun by SWUM, and the
guages, but in this paper we limit the scope to Java methods. keyword associated with the method name is a verb, we as-
The data we collect about these Java methods is our “com- sume that there is a verb/direct-object relationship between
municative goal” (see Section 3.2) and is the basis for the the method name and the input name. This relationship is
information we convey via NLG. recorded as a Quick Summary Message.
4.1 Data Collection Another type of message is the Importance Message. The
idea behind an importance message is to give programmers
The comment generator requires three external tools to
clues about how much time to spend reading a method.
produce the necessary input data: SWUM, the call graph
The importance message is created by interpreting both the
generator, and PageRank. SWUM parses the grammati-
PageRank value of the method and the PageRank values of
cal structure from the function and argument names in a
all other methods. The importance message represents how
method declaration. This allows us to describe the method
high this value is above or below average. At the same time,
based on the contents of its static features. Specifically,
an importance message will trigger our NLG system to in-
SWUM outputs the keywords describing the methods, with
clude more information in the method’s description if the
each keyword tagged with a part-of-speech (Figure 2, area
method is ranked highly (see Aggregation below).
3). Next, we produce a call graph of the project for which
A third message type is the Output Usage Message. This
we are generating comments. Our call graph2 allows us to
message conveys information about the method’s output,
see where a method is called so that we can determine the
such as “the character returned by this method is used to
method’s context (Figure 2, area 2). Finally, we obtain a
skip whitespace in character streams.” Our system uses data
PageRank value for every method by executing the PageR-
from quick summary messages, importance messages, and
ank algorithm with the procedure outlined in Section 3.3.
the call graph to create output usage messages. Given a
In addition to gleaning this information from the project
method, our system creates an output usage message by first
to produce our comments, we also use the source code of
finding the methods in the call graph which depend on the
the project itself. For every method call in the call graph,
given method. Then, it picks the two of those methods with
the Data Organizer searches through the code to find the
the highest PageRank. It uses the quick summary message
statement that makes that call. The purpose of collecting
from those two methods to describe how the output is used.
these statements is to provide a concrete usage example to
The last message type we will examine in detail is the Use
the programmer. The Data Organizer combines these ex-
Message. This message serves to illustrate how a program-
ample statements with the call graph and SWUM keywords
mer can use the method by highlighting a specific example in
to create the Project Metadata (Figure 2, area 4).
the code. For example, one message we generated was “the
4.2 Natural Language Generation method can be used in an assignment statement ; for exam-
ple: Date releaseDate=getReleaseDate();.” Our system uses
This section covers our NLG system. Our system pro-
the call graph to find a line of code that calls the method
cesses the Project Metadata as input (Figure 2, area 5),
for which we are generating the message. It then classifies,
following each of the NLG steps shown in Figure 1.
based on static features with the line of code, whether the
Content Determination. We create four different types
calling statement is a conditional, iteration, assignment, or
of “messages” (see Section 3.2) that represent information
procedural statement. If a source code example cannot be
about a method’s context. While all message types may be
found, the Use Message is omitted.
downloaded from our online appendix, due to space limi-
Document Structuring. After generating the initial
tations, we discuss only four representative messages here.
messages in the content determination phase, we organize
First, a Quick Summary Message represents a brief, high-
all the messages into a single document plan. We use a tem-
level action summarizing a whole method. For example,
plated document plan where messages occur in a pre-defined
“skips whitespace in character streams.” We create these
order: Quick Summary Messages, Return Messages, Output
messages from the noun/verb labeling of identifier names
Used Messages, Called Messages, Importance Messages, and
2 then Use Messages. Note that this order may change during
Generated using java-callgraph, available via https://
github.com/gousiosg/java-callgraph, verified 9/12/2013 the Aggregation phase below.
Lexicalization. Each type of message needs a different Consider getResult() from StdXMLBuilder.java in Nano-
type of phrase to describe it. This section will describe how XML. The method’s signature, public Object getResult(),
we decide on the words to be used in each of those phrases, is parsed by SWUM which will tell us the verb is “get”and
for the four message types described under Content Determi- the object is “result.” Additionally, it will note the return
nation. Note that the phrases we generate are not complete type as “object.” This will be used to generate the Quick
sentences; they will be grouped with other phrases during Summary Message “This method gets the result and returns
Aggregation and formed into sentences during realization. an Object.” Then, using the call graph, we determine that
The Quick Summary Message records a verb/direct-object the top two methods (as scored by PageRank) that call get-
relationship between two words extracted by SWUM. The Result() are scanData() and parse(). Initially, in the doc-
conversion to a sentence is simple in this case: the verb be- ument planning phase, we generate two separate messages,
comes the verb in the sentence, and likewise for the direct- one using the SWUM information for each function. How-
object. The subject is assumed to be “the method”, but is ever, these are combined in the aggregation step with the
left out for brevity. To give the reader further information conjunction “and”, and eventually produces the Output Us-
about the method’s purpose, we add the input parameter age Message “That Object is used by methods that scans
type as an indirect object using the preposition “in”. the data and that parses the std XML parser.”
We create a phrase for an Output Usage Message by set- The last message we generate is the Use Message. We
ting the object as the return type of the method, and the search through the most important calling method, which
verb as “is”. The subject is the phrase generated from the in this case is scanData(). We take a line of code that calls
Quick Summary Message. We set the voice of the phrase getResult(), and determine based on its content whether it
to be passive. We decided to use passive voice to empha- is a conditional, iteration, assignment, or procedural state-
size how the return data is used, rather than the contents of ment. Using this information, we generate the Use Message
the Quick Summary Message. An example of the phrase we “The method can be used in an iteration statement ; for ex-
output is under the Content Determination section. ample: while ((!this.reader.atEOF()) && (this.build-
The Use Message is created with the subject “this method”, er.getResult() == null)) { ”. Each of these messages
the verb phrase “can be used”, and appending the prepo- are then appended together to make the final summary.
sitional phrase ”as a statement type;”. Statement type is
pulled from the data structures populated in our content 6. EVALUATION
determination step. Additionally, we append a second de- Our evaluation compares our approach to the state-of-the-
pendent clause “for example: code”. art approach described by Sridhara et al. [46]. The objec-
Reference Generation and Aggregation. During Ag- tive of our evaluation is three-fold: 1) to assess the degree
gregation, we create more-complex and readable phrases to which our summaries meet the quality of summaries gen-
from the phrases generated during Lexicalization. Our sys- erated by a state-of-the-art solution, 2) to assess whether
tem works by looking for patterns of message types, and then the summaries provide useful contextual information about
grouping the phrases of those messages into a sentence. For the Java methods, and 3) to determine whether the gener-
example, if two Output Usage Messages are together, and ated summaries can be used to improve, rather than replace,
both refer to the same method, then the phrases of those two existing documentation.
messages are conjoined with an “and” and the subject and Assessing Overall Quality. One goal of our evaluation
verb for the second phrase is hidden. In another case, if a is to quantify any difference in quality between our approach
Quick Summary Message follows a Quick Summary Message presented in this paper and the existing state-of-the-art ap-
for a different method, then it implies that the messages are proach, and to determine in what areas the quality of the
related, and we connect them using the preposition “for”. summaries can be most improved. To assess quality, we ask
The result is a phrase such as “skips whitespace in charac- the three following Research Questions (RQs):
ter streams for a method that processes xml”. Notice that
Reference Generation occurs alongside Aggregation. Rather RQ1 To what degree do the summaries from our approach
than hiding the subject in the phrase “processes xml”, we and the state-of-the-art approach differ in overall ac-
make it explicit as “method” and non-specific using the arti- curacy?
cle “a” rather than “the.” Due to space limitations, we direct RQ2 To what degree do the summaries from our approach
readers to our online appendix for a complete listing of the and the state-of-the-art approach differ in terms of
Aggregation techniques we follow. missing important information?
Surface Realization. We use an external library, sim-
plenlg [13], to realize complete sentences from the phrases RQ3 To what degree do the summaries from our approach
formed during Aggregation. In the above steps, we set all and the state-of-the-art approach differ in terms of in-
words and parts-of-speech and provided the structure of the cluding unnecessary information?
sentences. The external library follows English grammar
These Research Questions are derived from two earlier
rules to conjugate verbs, and ensure that the word order,
evaluations of source code summarization [34,46], where the
plurals, and articles are correct. The output from this step
“quality” of the generated comments was assessed in terms
is the English summary of the method (Figure 2, area 6).
of accuracy, content adequacy, and conciseness. Content ad-
equacy referred to whether there was missing information,
5. EXAMPLE while conciseness referred to limiting unnecessary informa-
In this section, we explore an example of how we form a tion in the summary. This strategy for evaluating generated
summary for a specific method. We will elaborate on how comments is supported by a recent study of source code com-
we use SWUM, call graph, PageRank, and source code to ments [49] in which quality was modeled as a combination of
form our messages. factors correlating to accuracy, adequacy, and conciseness.
Assessing Contextual Information. Contextual in-
formation about a method is meant to help programmers Table 1: The cross-validation design of our user
understand the behavior of that method. But, rather than study. Different participants read different sum-
describe that behavior directly from the internals of the maries for different programs.
method itself, context explains how that method interacts Round Group Summary Program 1 Program 2
with other methods in a program. By reading the context, A Our NanoXML Jajuk
programmers then can understand what the method does, 1 B S.O.T.A. Siena JEdit
why it exists, and how to use it (see Section 2). Therefore, C Combined JTopas JHotdraw
we study these three Research Questions: A Combined Siena Jajuk
2 B Our JTopas JEdit
RQ4 Do the summaries help programmers understand what C S.O.T.A. NanoXML JHotdraw
the methods do internally? A S.O.T.A. JTopas Jajuk
3 B Combined NanoXML JEdit
RQ5 Do the summaries help programmers understand why
C Our Siena JHotdraw
the methods exist?
RQ6 Do the summaries help programmers understand how
to use the methods?
Table 2: The questions we ask during the user study.
The rationale behind RQ4 is that a summary should pro- The first six are answerable as “Strongly Agree”,
vide programmers with enough details to understand the “Agree”, “Disagree”, and “Strongly Disagree.” The
most-important internals of the method—for example, the last two are open-ended.
type of algorithm the method implements—without forcing Independent of other factors, I feel that the sum-
them to read the method’s source code. Our summaries Q1 mary is accurate.
aim to include this information solely from the context. If
The summary is missing important information,
our summaries help programmers understand the methods’
key internals, it means that this information came from the Q2 and that can hinder the understanding of the
method.
context. For RQ5 , a summary should help programmers
understand why the method is important to the program The summary contains a lot of unnecessary infor-
as a whole. For example, the programmers should be able Q3 mation.
to know, from reading the summary, what the consequences The summary contains information that helps me
might be of altering or removing the method. Likewise, for understand what the method does (e.g., the inter-
RQ6 , the summary should explain the key details about how Q4
nals of the method).
a programmer may use the method in his or her own code.
Orthogonality. While the ultimate goal of this research The summary contains information that helps me
is to generate documentation purely from data in the source understand why the method exists in the project
Q5 (e.g., the consequences of altering or removing the
code, we also aim to improve existing documentation by
adding contextual information. In particular, we ask: method).
The summary contains information that helps me
RQ7 Do the summaries generated by our solution contain Q6 understand how to use the method.
orthogonal information to the information already in
the summaries from the state-of-the-art solution? In a sentence or two, please summarize the
Q7 method in your own words.
The idea behind this RQ is that to improve existing sum- Do you have any general comments about the
maries, the generated summaries should contribute new in- Q8
given summary?
formation, not merely repeat what is already in the sum-
maries. We generate summaries by analyzing the context
of methods, so it is plausible that we add information from The purpose of this rotation was to ensure that all evalu-
this context, which does not exist in the summaries from the ators would read summaries from each different approach
state-of-the-art solution. for several different Java programs, and to mitigate any bias
from the order in which the approaches and methods were
6.1 Cross-Validation Study Methodology presented [32]. Table 1 shows our study design in detail.
To answer our Research Questions, we performed a cross- Upon starting the study, each participant was randomly as-
validation study in which human experts (e.g., Java pro- signed to one of three groups. Each of those groups was then
grammers) read the source code of different Java methods, assigned to see one of three types of summary: summaries
as well as summaries of those methods, for three different from our approach, summaries from the state-of-the-art ap-
rounds. For each method and summary, the experts an- proach, or both summaries at the same time.
swered eight questions that covered various details about
the summary. Table 2 lists these questions. The first six 6.2 Subject Java Programs
correspond to each of the Research Questions above, and The summaries in the study corresponded to Java meth-
were multiple choice. The final two were open-ended ques- ods from six different subject Java programs, listed in Ta-
tions; we study the responses to these two questions in a ble 3. We selected these programs for a range of size (5 to
qualitative evaluation in Section 8. 117 KLOC, 318 to 7161 methods) and domain (including
In the cross-validation study design, we rotated the sum- text editing, multimedia, and XML parsing, among others).
maries and Java methods that the human evaluators read. During the study, participants were assigned to see methods
from four of these applications. During each of three differ- in programming experience. We attempted to mitigate these
ent rounds, we rotated one of the programs that the groups threats through our cross-validation study design, which al-
saw, but retained the fourth program. The reason is so that tered the order in which the participants viewed the Java
the group would evaluate different types of summaries for methods and summaries. We also recruited our participants
different programs, but also evaluate different types of sum- from a diverse body of professionals and students, and con-
maries from a single application. From each application, we firmed our results with accepted statistical testing proce-
pre-selected (randomly) a pool of 20 methods from each ap- dures. Still, we cannot guarantee that a different group of
plication. At the start of each round, we randomly selected participants would not produce a different result.
four methods from the pool for the rotated application, and Another source for a threat to validity is the set of Java
four from the fixed application. Over three rounds, partici- programs we selected. We chose a variety of applications
pants read a total of 24 methods. Because the methods were of different sizes and from different domains. In total, we
selected randomly from a pool, the participants did not all generated summaries for over 19,000 Java methods from six
see the same set of 24 methods. The programmers could projects, and randomly selected 20 of these methods from
always read and navigate the source code for these applica- each project to be included in the study (four to twelve of
tions, though we removed all comments from this code to which were ultimately shown to each participant). Even
avoid introducing a bias from these comments. with this large pool of methods, it is still possible that our
results would change with a different projects. To help mit-
6.3 Participants igate this threat, we have released our tool implementation
We had 12 participants in our study. Nine were graduate and all evaluation data in an online appendix, so that other
students and from the Computer Science and Engineering researchers may reproduce our work in independent studies.
Department at the University of Notre Dame. The remain-
ing three were professional programmers from two different 7. EMPIRICAL RESULTS
organizations, not listed due to our privacy policy. This section reports the results of our evaluation. First,
we present our statistical process and evidence. Then, we
6.4 Metrics and Statistical Tests explain our interpretation of this evidence and answer our
Each of the multiple choice questions could be answered
research questions.
as “Strongly Agree”, “Agree”, “Disagree”, or “Strongly Dis-
agree.” We assigned a values to these answers as 4 for 7.1 Statistical Analysis
“Strongly Agree”, 3 for “Agree”, 2 for “Disagree”, and 1 for The main independent variable was the type of summary
“Strongly Disagree.” For questions 1, 4, 5, and 6, higher rated by the participants: summaries generated by our solu-
values indicate stronger performance. For questions 2 and tion, summaries from the state-of-the-art solution, or both
3, lower values are preferred. We aggregated the responses presented together. The dependent variables were the rat-
for each question by approach. For example, all responses ings for each question: 4 for “Strongly Agree” to 1 for “Strongly
to question 1 for the summaries from our approach, and all Disagree”.
responses to question 1 for the summaries from the state-of- For each question, we compare the mean of the partic-
the-art approach. ipants’ ratings for our summaries to the summaries from
To determine the statistical significance of the differences the state-of-the-art approach. We also compare the ratings
in these groups, we used the two-tailed Mann-Whitney U given when both summaries were shown, versus only the
test [44]. The Mann-Whitney test is non-parametric, and state-of-the-art summaries. We compared these values us-
it does not assume that the data are normally distributed. ing the Mann-Whitney test (see Section 6.4). Specifically,
However, the results of these tests may not be accurate if we posed 12 hypotheses of the form:
the number of participants in the study is too small. There- Hn The difference in the reported ratings of the responses
fore, to confirm statistical significance, we use the procedure for Qm is not statistically-significant.
outlined by Morse [35] to determine the minimum popula-
tion size for a tolerated p value of 0.05. We calculated these where n ranges from 1 to 12, and m ranges from 1 to 6,
minimum sizes using observed, not expected, values of U . depending on which question is being tested. For example,
in H11 , we compare the answers to Q5 for the state-of-the-
6.5 Threats to Validity art summaries to the answers to Q5 for the combined sum-
As with any study, our evaluation carries threats to valid- maries.
ity. We identified two main sources of these threats. First, Table 4 shows the rejected hypotheses (e.g., the means
our evaluation was conducted by human experts, who may with a statistically-significant difference). We made a deci-
be influenced by factors such as stress, fatigue, or variations sion to reject a hypothesis only when three criteria were met.
First, |Z| must be greater than Zcrit . Second, p must be less
Table 3: The six Java programs used in our evalu- than the tolerated error 0.05. Finally, the calculated min-
ation. KLOC reported with all comments removed. imum number of participants (Nmin ) must be less than or
All projects are open-source. equal to the number of participants in our study (13). Note
Methods KLOC Java Files that due to space limitations we do not include values for the
NanoXML 318 5.0 28 four hypotheses for which we do not have evidence to reject.
Siena 695 44 211 For reproducibility purposes, these results are available at
JTopas 613 9.3 64 our online appendix.
Jajuk 5921 70 544 7.2 Interpretation
JEdit 7161 117 555 Figure 3 showcases the key evidence we study in this eval-
JHotdraw 5263 31 466 uation. We use this evidence to answer our Research Ques-
tions along the three areas highlighted in Section 6.
Table 4: Statistical summary of the results for the participants’ ratings for each question. “Samp.” is the
number of responses for that question for a given summary type, for all rounds. Mann-Whitney test values
are U , Uexpt , and Uvari . Decision criteria are Z, Zcrit , and p. Nmin is the calculated minimum number of
participants needed for statistical significance.
H Q Summary n x̃ µ Vari. U Uexpt Uvari Z Zcrit p Nmin Decision
Our 65 3 3.015 0.863
H1 Q1 1223 1917 34483 3.74 1.96 <1e-3 2 Reject
S.O.T.A. 59 3 2.390 0.863
Our 65 3 2.492 0.973
H2 Q2 2272 1885 36133 2.03 1.96 0.042 2 Reject
S.O.T.A. 58 3 2.862 1.139
Our 65 2 1.815 0.497
H3 Q3 2011 1917 34204 0.503 1.96 0.615 15 Not Reject
S.O.T.A. 59 2 1.983 0.982
Our 65 3 2.877 0.641
H4 Q4 1410 1917 34475 2.736 1.96 0.006 4 Reject
S.O.T.A. 59 3 2.407 0.832
Our 65 3 2.585 0.809
H5 Q5 1251 1885 35789 3.351 1.96 0.001 2 Reject
S.O.T.A. 58 3 1.983 0.930
Our 65 3 2.769 0.649
H6 Q6 799 1885 35447 5.771 1.96 <1e-3 3 Reject
S.O.T.A. 58 3 1.776 0.773
Combined 59 3 2.847 0.580
H7 Q1 1266 1741 29026 2.788 1.96 0.005 3 Reject
S.O.T.A. 59 3 2.390 0.863
Combined 59 2 2.322 0.843
H8 Q2 2219 1711 31205 2.876 1.96 0.004 3 Reject
S.O.T.A. 58 3 2.862 1.139
Combined 59 2 2.542 1.149
H9 Q3 1227 1741 31705 2.884 1.96 0.004 5 Reject
S.O.T.A. 59 2 1.983 1.149
Combined 58 3 2.879 0.564
H10 Q4 1241 1711 27964 2.811 1.96 0.005 5 Reject
S.O.T.A. 59 3 2.407 0.832
Combined 59 3 2.508 0.634
H11 Q5 1176 1711 30423 3.064 1.96 0.002 4 Reject
S.O.T.A. 58 2 1.983 0.930
Combined 59 3 2.746 0.503
H12 Q6 705 1711 29972 5.814 1.96 <1e-3 2 Reject
S.O.T.A. 58 2 1.776 0.773
(a) Our vs. S.O.T.A. Summaries (b) Combined vs. S.O.T.A. Summaries
Figure 3: Performance comparison of the summaries. The chart shows the difference in the means of the
responses to each question. For example in (a), the mean of Q5 for our approach is 0.602 higher than for the
state-of-the-art summaries. The sign is reversed for Q2 and Q3 because lower scores, not higher scores, are
better values for those questions. Solid bars indicate differences which are statistically-significant. In general,
our summaries were more accurate and provided more-thorough contextual information.
Overall Quality. The summaries from our approach are Contextual Information. The summaries from our ap-
superior in overall quality to the summaries from the state- proach included more contextual information than the state-
of-the-art approach. Figure 3(a) shows the difference in of-the-art summaries. The differences in responses for ques-
the means of the responses for survey questions. Questions tions Q4 , Q5 , and Q6 are higher for our summaries by a
Q1 through Q3 refer to aspects of the summaries related statistically-significant margin. These results mean that,
to overall quality, in particular to our Research Questions in comparison to the state-of-the-art summaries, our sum-
RQ1 to RQ3 . In short, participants rated our summaries maries helped the programmers understand why the meth-
as more accurate and as missing less required information ods exist and how to use those methods. Therefore, we
by a statistically-significant margin. While these results are answer RQ4 , RQ5 , and RQ6 with a positive result. The
encouraging progress, they nevertheless still point to a need answers to these research questions point to an important
to improve. In Section 8, we explore what information that niche filled by our approach: the addition of contextual in-
the participants felt was unnecessary in our summaries. formation to software documentation.
Orthogonality. We found substantial evidence showing • “A bit sparse and missing a lot of information.”
that our summaries improve the state-of-the-art summaries.
When participants read both types of summary for a given • “Comment details the inner workings but provides no
method, the responses for Q4 , Q5 , and Q6 improved by a sig- big picture summary.”
nificant margin, pointing to an increase in useful contextual • “Only provides a detail for one of the possible branches.”
information in the documentation. Overall quality did not
decrease by a significant margin compared to when only our • “It seems the summary is generated only based on the
solutions were given, except in terms of unnecessary infor- last line of the method.”
mation added. Consider Figure 3(b): Accuracy and missing These comments occur more frequently with the state-of-
information scores showed similar improvement. While the the-art compared to our approach. A possible reason for
combined summaries did show a marked increase in unneces- this is our approach focuses much more on method inter-
sary information, we still find evidence to positively answer actions (e.g., method calls), and avoids the internal details
RQ7 : the information added to state-of-the-art summaries of the function. By contrast, the state-of-the-art approach
by our approach is orthogonal. This answer suggests that focuses on a method’s internal execution, selecting a small
our approach can be used, after future work to reduce un- subset of statements to use in the summary. Participants
necessary information, to improve existing documentation. felt this selection often leaves out error checking and alter-
nate branches, focusing too narrowly on particular internal
8. QUALITATIVE RESULTS operations while ignoring others.
Participants in the evaluation study had the opportunity Several of our generated summaries and the state-of-the-
to write an opinion about each summary (see Q8 in Table 2). art generated summaries had grammar issues that distracted
In this section, we explore these opinions for feedback on our users. Additionally, the state-of-the-art approach often se-
approach and directions for future work. lected lines of source code, but did not generate English
One of the results in our study was the significantly worse summaries for those lines. Several users commented on these
performance of Q3 in the combined comments, suggesting an issues, noting that it made the summaries either impossible
increase in the amount of unnecessary information. Several or difficult to understand. Our aim is to correct these issues
user comments from our survey note concerns of repetitious going forward with refinement of our NLG tool.
information, as well as difficulties in processing the longer Another common theme of participant comments in both
comments that result from the combination. our approach and the state-of-the-art centered on function
• “The description is too verbose and contains too many input parameters. Many participants felt an explanation of
details.” input parameters was lacking in both approaches, as well
as the combination approach. A selection of these com-
• “The summary contains too much information and con- ments follows. These comments were selected from our ap-
fuses the purpose of the method...” proach, the state-of-the-art, and the combined approach re-
spectively:
• “The summary seems accurate but too verbose.”
• “The input parameters publicID and systemID are not
• “Too much information, I cannot understand the com- defined – what are they exactly?”
ment.”
• “The summary could mention the input required is the
Another result is the increase in the scores for Q5 and path for the URL”
Q6 , which deal with how a programmer can use the method
within the system. This increase appears to be due to the • “... It would be better if the summary described the
Use Message. Several users noted a lack of any form of types of the inputs...”
usage message in the state-of-the-art approach. A selection 9. RELATED WORK
of these comments follows.
The related work closest to our approach is detailed in a
• “Nice and concise, but lacking information on uses...” recent thesis by Sridhara [45]. In Section 3, we summarized
certain elements of this work that inspired our approach.
• “The summary is clear. An example is expected.” Two aspects we did not discuss are as follows. First, one
• “The summary...does not tell me where the method is approach creates summaries of the “high level actions” in a
called or how it is used.” method [47]. A high level action is defined as a behavior
at a level of abstraction higher than the method. The ap-
Additionally, in a method summary from our approach proach works by identifying which statements in a method
that did not generate a Use Message, a participant noted implement that behavior, and summarizing only those state-
“I feel that an example should be provided.” However, one ments. A second approach summarizes the role of the pa-
participant in our study had a largely negative opinion of rameters to a method. This approach creates a description of
the Use Message. This participant repeatedly referred to key statements related to the parameter inside the method.
the “last sentence” (the Use Message) as “unnecessary”, even Our approach is different from both of these approaches in
stating “Assume every one of these boxes comments about that we create summaries from the context of the method –
removing the last line of the provided comment.” that is, where the method is invoked. We help programmers
Participants often felt the state-of-the-art approach lacked understand the role the method plays in the software.
critical information about the function. Comments indicat- There are a number of other approaches that create nat-
ing a lack of information appeared consistently from many ural language summaries of different software artifacts and
participants. The following comments (each from a different behaviors. Moreno et al. describe a summarization tech-
participant) support this criticism: nique for Java classes that match one of 13 “stereotypes.” [34]
The technique selects statements from the class based on summaries written by a state-of-the-art solution. We found
this stereotype, and then uses the approach by Sridhara [46] that our summaries were superior in quality and that our
to summarize those statements. Work by Buse et al. fo- generated summaries fill a key niche by providing contextual
cuses on Java exceptions [3]. Their technique is capable of information. That context is missing from the state-of-the-
identifying the conditions under which an exception will be art summaries. Moreover, we found that by combining our
thrown, and producing brief descriptions of those conditions. summaries with the state-of-the-art summaries, we can im-
Recent work by Zhang et al. performs a similar function by prove existing software documentation. Finally, the source
explaining failed tests [56]. That approach modifies a failed code for our tool’s implementation and evaluation data are
test by swapping different expressions into the test to find publically available for future researchers.
the failure conditions. Summary comments of those con-
ditions are added to the test. Another area of focus has 11. ACKNOWLEDGMENTS
been software changes. One approach is to improve change
The authors would like to thank Dr. Emily Hill for provid-
log messages [4]. Alternatively, work by Kim et al. infers
ing key assistance with the SWUM tool. We also thank and
change rules, as opposed to individual changes, that explain
acknowledge the Software Analysis and Compilation Lab
the software’s evolution [23]. The technique can summarize
at the University of Delaware for important help with the
the high-level differences between two versions of a program.
state-of-the-art summarization tool. Finally, we thank the
Another approach, developed by Panichella et al., uses ex-
12 participants who spent time and effort completing our
ternal communications between developers, such as bug re-
evaluation.
ports and e-mails, and structures them to produce source
code documentation [37].
The key difference between our approach and these ex- 12. REFERENCES
isting approaches is that we summarize the context of the [1] J. Aponte and A. Marcus. Improving traceability link
source code, such as how the code is called or the output recovery methods through software artifact
is used. Structural information has been summarized be- summarization. In Proceedings of the 6th International
fore, in particular by Murphy [36], in order to help program- Workshop on Traceability in Emerging Forms of
mers understand and evolve software. Murphy’s approach, Software Engineering, TEFSE ’11, pages 46–49, New
the software reflexion model, notes the connections between York, NY, USA, 2011. ACM.
low-level software artifacts in order to point out connections [2] H. Burden and R. Heldal. Natural language generation
between higher-level artifacts. There are techniques which from class diagrams. In Proceedings of the 8th
give programmers some contextual information by listing the International Workshop on Model-Driven Engineering,
important keywords from code. For example, Haiduc et al. Verification and Validation, MoDeVVa, pages 8:1–8:8,
use a Vector Space Model to rank keywords from the source New York, NY, USA, 2011. ACM.
code, and present those keywords to programmers [15]. The [3] R. P. Buse and W. R. Weimer. Automatic
approach is based on the idea that programmers read source documentation inference for exceptions. In Proceedings
code cursorily by reading these keywords, and use that in- of the 2008 international symposium on Software
formation to deduce the context behind the code. Follow- testing and analysis, ISSTA ’08, pages 273–282, New
up studies have supported the conclusions that keyword-list York, NY, USA, 2008. ACM.
summarization is useful to programmers [1], and that VSM [4] R. P. Buse and W. R. Weimer. Automatically
is an effective strategy for extracting these keywords [6, 9]. documenting program changes. In Proceedings of the
Tools such as Jadeite [51], Apatite [10], and Mica [50] IEEE/ACM international conference on Automated
are related to our approach in that they add API usage software engineering, ASE ’10, pages 33–42, New
information to documentation of those APIs. These tools York, NY, USA, 2010. ACM.
visualize the usage information as part of the interface for [5] W.-K. Chan, H. Cheng, and D. Lo. Searching
exploring or locating the documentation. We take a differ- connected api subgraph via text phrases. In
ent strategy by summarizing the information as natural lan- Proceedings of the ACM SIGSOFT 20th International
guage text. What is similar is that this work demonstrates Symposium on the Foundations of Software
a need for documentation to include the usage data, as con- Engineering, FSE ’12, pages 10:1–10:11, New York,
firmed by studies of programmers during software mainte- NY, USA, 2012. ACM.
nance [22, 28, 53]. [6] A. De Lucia, M. Di Penta, R. Oliveto, A. Panichella,
and S. Panichella. Using ir methods for labeling source
10. CONCLUSION code artifacts: Is it worthwhile? In Program
We have presented a novel approach for automatically Comprehension (ICPC), 2012 IEEE 20th
generating summaries of Java methods. Our approach dif- International Conference on, pages 193–202, June
fers from previous approaches in that we summarize the con- 2012.
text surrounding a method, rather than details from the in- [7] S. C. B. de Souza, N. Anquetil, and K. M. de Oliveira.
ternals of the method. We use PageRank to locate the most- A study of the documentation essential to software
important methods in that context, and SWUM to gather maintenance. In Proceedings of the 23rd annual
relevant keywords describing the behavior of those methods. international conference on Design of communication:
Then, we designed a custom NLG system to create natural documenting & designing for pervasive information,
language text about this context. The output is a set of SIGDOC ’05, pages 68–75, New York, NY, USA, 2005.
English sentences describing why the method exists in the ACM.
program, and how to use the method. In a cross-validation [8] E. Duala-Ekoko and M. P. Robillard. Asking and
study, we compared the summaries from our approach to answering questions about unfamiliar apis: an
exploratory study. In Proceedings of the 2012 Conference on Software Engineering, ICSE ’03, pages
International Conference on Software Engineering, 14–24, Washington, DC, USA, 2003. IEEE Computer
ICSE 2012, pages 266–276, Piscataway, NJ, USA, Society.
2012. IEEE Press. [21] M. Kajko-Mattsson. A survey of documentation
[9] B. Eddy, J. Robinson, N. Kraft, and J. Carver. practice within corrective maintenance. Empirical
Evaluating source code summarization techniques: Softw. Engg., 10(1):31–55, Jan. 2005.
Replication and expansion. In Proceedings of the 21st [22] T. Karrer, J.-P. Krämer, J. Diehl, B. Hartmann, and
International Conference on Program Comprehension, J. Borchers. Stacksplorer: call graph navigation helps
ICPC ’13, 2013. increasing code maintenance efficiency. In Proceedings
[10] D. S. Eisenberg, J. Stylos, and B. A. Myers. Apatite: of the 24th annual ACM symposium on User interface
a new interface for exploring apis. In Proceedings of software and technology, UIST ’11, pages 217–224,
the SIGCHI Conference on Human Factors in New York, NY, USA, 2011. ACM.
Computing Systems, CHI ’10, pages 1331–1334, New [23] M. Kim, D. Notkin, D. Grossman, and G. Wilson.
York, NY, USA, 2010. ACM. Identifying and summarizing systematic code changes
[11] B. Fluri, M. Wursch, and H. C. Gall. Do code and via rule inference. IEEE Transactions on Software
comments co-evolve? on the relation between source Engineering, 39(1):45 –62, Jan. 2013.
code and comment changes. In Proceedings of the 14th [24] A. J. Ko, B. A. Myers, and H. H. Aung. Six learning
Working Conference on Reverse Engineering, WCRE barriers in end-user programming systems. In
’07, pages 70–79, Washington, DC, USA, 2007. IEEE Proceedings of the 2004 IEEE Symposium on Visual
Computer Society. Languages - Human Centric Computing, VLHCC ’04,
[12] A. Forward and T. C. Lethbridge. The relevance of pages 199–206, Washington, DC, USA, 2004. IEEE
software documentation, tools and technologies: a Computer Society.
survey. In Proceedings of the 2002 ACM symposium on [25] D. Kramer. Api documentation from source code
Document engineering, DocEng ’02, pages 26–33, New comments: a case study of javadoc. In Proceedings of
York, NY, USA, 2002. ACM. the 17th annual international conference on Computer
[13] A. Gatt and E. Reiter. Simplenlg: a realisation engine documentation, SIGDOC ’99, pages 147–153, New
for practical applications. In Proceedings of the 12th York, NY, USA, 1999. ACM.
European Workshop on Natural Language Generation, [26] J. Krinke. Effects of context on program slicing. J.
ENLG ’09, pages 90–93, Stroudsburg, PA, USA, 2009. Syst. Softw., 79(9):1249–1260, Sept. 2006.
Association for Computational Linguistics. [27] A. N. Langville and C. D. Meyer. Google’s PageRank
[14] E. Goldberg, N. Driedger, and R. Kittredge. Using and Beyond: The Science of Search Engine Rankings.
natural-language processing to produce weather Princeton University Press, Princeton, NJ, USA, 2006.
forecasts. IEEE Expert, 9(2):45–53, April 1994. [28] T. D. LaToza and B. A. Myers. Developers ask
[15] S. Haiduc, J. Aponte, L. Moreno, and A. Marcus. On reachability questions. In Proceedings of the 32nd
the use of automated text summarization techniques ACM/IEEE International Conference on Software
for summarizing source code. In Proceedings of the Engineering - Volume 1, ICSE ’10, pages 185–194,
2010 17th Working Conference on Reverse New York, NY, USA, 2010. ACM.
Engineering, WCRE ’10, pages 35–44, Washington, [29] D. Lawrie, C. Morrell, H. Feild, and D. Binkley.
DC, USA, 2010. IEEE Computer Society. What’s in a name? a study of identifiers. In In 14th
[16] E. Hill. Integrating Natural Language and Program International Conference on Program Comprehension,
Structure Information to Improve Software Search and pages 3–12. IEEE Computer Society, 2006.
Exploration. PhD thesis, Newark, DE, USA, 2010. [30] T. C. Lethbridge, J. Singer, and A. Forward. How
AAI3423409. software engineers use documentation: The state of
[17] E. Hill, L. Pollock, and K. Vijay-Shanker. the practice. IEEE Softw., 20(6):35–39, Nov. 2003.
Automatically capturing source code context of [31] S. Mani, R. Catherine, V. S. Sinha, and A. Dubey.
nl-queries for software maintenance and reuse. In Ausum: approach for unsupervised bug report
Proceedings of the 31st International Conference on summarization. In Proceedings of the ACM SIGSOFT
Software Engineering, ICSE ’09, pages 232–242, 20th International Symposium on the Foundations of
Washington, DC, USA, 2009. IEEE Computer Society. Software Engineering, FSE ’12, pages 11:1–11:11, New
[18] R. Holmes and G. C. Murphy. Using structural York, NY, USA, 2012. ACM.
context to recommend source code examples. In [32] C. D. Manning, P. Raghavan, and H. Schtze.
Proceedings of the 27th international conference on Introduction to Information Retrieval. Cambridge
Software engineering, ICSE ’05, pages 117–125, New University Press, New York, NY, USA, 2008.
York, NY, USA, 2005. ACM. [33] C. McMillan, M. Grechanik, D. Poshyvanyk, Q. Xie,
[19] W. M. Ibrahim, N. Bettenburg, B. Adams, and A. E. and C. Fu. Portfolio: finding relevant functions and
Hassan. Controversy corner: On the relationship their usage. In Proceedings of the 33rd International
between comment update practices and software bugs. Conference on Software Engineering, ICSE ’11, pages
J. Syst. Softw., 85(10):2293–2304, Oct. 2012. 111–120, New York, NY, USA, 2011. ACM.
[20] K. Inoue, R. Yokomori, H. Fujiwara, T. Yamamoto, [34] L. Moreno, J. Aponte, S. Giriprasad, A. Marcus,
M. Matsushita, and S. Kusumoto. Component rank: L. Pollock, and K. Vijay-Shanker. Automatic
relative significance rank for software component generation of natural language summaries for java
search. In Proceedings of the 25th International classes. In Proceedings of the 21st International
Conference on Program Comprehension, ICPC ’13, K. Vijay-Shanker. Towards automatically generating
2013. summary comments for java methods. In Proceedings
[35] D. T. Morse. Minsize2: A computer program for of the IEEE/ACM international conference on
determining effect size and minimum sample size for Automated software engineering, ASE ’10, pages
statistical significance for univariate, multivariate, and 43–52, New York, NY, USA, 2010. ACM.
nonparametric tests. Educational and Psychological [47] G. Sridhara, L. Pollock, and K. Vijay-Shanker.
Measurement, 59(3):518–531, June 1999. Automatically detecting and describing high level
[36] G. C. Murphy. Lightweight structural summarization actions within methods. In Proceedings of the 33rd
as an aid to software evolution. PhD thesis, University International Conference on Software Engineering,
of Washington, July 1996. ICSE ’11, pages 101–110, New York, NY, USA, 2011.
[37] S. Panichella, J. Aponte, M. Di Penta, A. Marcus, and ACM.
G. Canfora. Mining source code descriptions from [48] G. Sridhara, L. Pollock, and K. Vijay-Shanker.
developer communications. In Program Generating parameter comments and integrating with
Comprehension (ICPC), 2012 IEEE 20th method summaries. In Proceedings of the 2011 IEEE
International Conference on, pages 63–72, June 2012. 19th International Conference on Program
[38] D. Puppin and F. Silvestri. The social network of java Comprehension, ICPC ’11, pages 71–80, Washington,
classes. In Proceedings of the 2006 ACM symposium DC, USA, 2011. IEEE Computer Society.
on Applied computing, SAC ’06, pages 1409–1413, [49] D. Steidl, B. Hummel, and E. Juergens. Quality
New York, NY, USA, 2006. ACM. analysis of source code comments. In Proceedings of
[39] E. Reiter and R. Dale. Building natural language the 21st International Conference on Program
generation systems. Cambridge University Press, New Comprehension, ICPC ’13, 2013.
York, NY, USA, 2000. [50] J. Stylos and B. A. Myers. Mica: A web-search tool for
[40] T. Roehm, R. Tiarks, R. Koschke, and W. Maalej. finding api components and examples. In Proceedings
How do professional developers comprehend software? of the Visual Languages and Human-Centric
In Proceedings of the 2012 International Conference Computing, VLHCC ’06, pages 195–202, Washington,
on Software Engineering, ICSE 2012, pages 255–265, DC, USA, 2006. IEEE Computer Society.
Piscataway, NJ, USA, 2012. IEEE Press. [51] J. Stylos, B. A. Myers, and Z. Yang. Jadeite:
[41] L. Shi, H. Zhong, T. Xie, and M. Li. An empirical improving api documentation using usage information.
study on evolution of api documentation. In In CHI ’09 Extended Abstracts on Human Factors in
Proceedings of the 14th international conference on Computing Systems, CHI EA ’09, pages 4429–4434,
Fundamental approaches to software engineering: part New York, NY, USA, 2009. ACM.
of the joint European conferences on theory and [52] A. A. Takang, P. A. Grubb, and R. D. Macredie. The
practice of software, FASE’11/ETAPS’11, pages Effects of Comments and Identifier Names on Program
416–431, Berlin, Heidelberg, 2011. Springer-Verlag. Comprehensibility: An Experimental Study. Journal
[42] J. Sillito, G. C. Murphy, and K. De Volder. Asking of Programming Languages, 4(3):143–167, 1996.
and answering questions during a programming [53] Y. Tao, Y. Dang, T. Xie, D. Zhang, and S. Kim. How
change task. IEEE Trans. Softw. Eng., 34(4):434–451, do software engineers understand code changes?: an
July 2008. exploratory study in industry. In Proceedings of the
[43] S. E. Sim, C. L. A. Clarke, and R. C. Holt. Archetypal ACM SIGSOFT 20th International Symposium on the
source code searches: A survey of software developers Foundations of Software Engineering, FSE ’12, pages
and maintainers. In Proceedings of the 6th 51:1–51:11, New York, NY, USA, 2012. ACM.
International Workshop on Program Comprehension, [54] D. van Heesch. Doxygen website, 2013.
IWPC ’98, pages 180–, Washington, DC, USA, 1998. [55] A. T. T. Ying and M. P. Robillard. Code fragment
IEEE Computer Society. summarization. In Proceedings of the 2013 9th Joint
[44] M. D. Smucker, J. Allan, and B. Carterette. A Meeting on Foundations of Software Engineering,
comparison of statistical significance tests for ESEC/FSE 2013, pages 655–658, New York, NY,
information retrieval evaluation. In CIKM, pages USA, 2013. ACM.
623–632, 2007. [56] S. Zhang, C. Zhang, and M. D. Ernst. Automated
[45] G. Sridhara. Automatic Generation of Descriptive documentation inference to explain failed tests. In
Summary Comments for Methods in Object-oriented Proceedings of the 2011 26th IEEE/ACM
Programs. PhD thesis, University of Delaware, Jan. International Conference on Automated Software
2012. Engineering, ASE ’11, pages 63–72, Washington, DC,
[46] G. Sridhara, E. Hill, D. Muppaneni, L. Pollock, and USA, 2011. IEEE Computer Society.