Using_Artificial_Intelligence_in_Source_Code_Summa
Using_Artificial_Intelligence_in_Source_Code_Summa
1. Introduction
1
Shraddha Birari, Computer Science and Engineering, MIT World Peace University, Pune, India
Email: [email protected].
S. Birari and S. Bhingarkar / Using Artificial Intelligence in Source Code Summarization 257
written should follow the same format throughout the whole document.
In this paper we present a review of various source code summarization
methodologies implemented so far using different natural language and machine
learning techniques.
This section gives brief idea about the different use cases where source code
summaries or comments written are useful.
The software document is an artifact which communicates the information about the
software system to the people implicated in the production of that software. The people
involved here are the customers, developers, project leaders as well as managers [7].
According to the survey conducted regarding the problems faced by developers in [8],
66 % developers faced challenges while understanding motive or purpose of the piece
of code. Understanding code written by someone else is a serious problem rated by
56% developers. 17% developers find it difficult to understand their own code written
just a while ago. This shows properly written documentation regarding the source code
is important in the software maintenance.
A code search is general activity while developing a source code. According to the
survey conducted among developers regarding the code search [11], 33.5 % people
search to refer the example code. 26 % people read or explore the code. 16% search for
code localization like “where class is instantiated”. 16% refer the code to determine the
impact like “understanding the dependencies” or “find side effect of proposed
changes”.8.5% refer for metadata like “who recently touched the code”. Also,
according to the same survey, on an average developer creates 12 search queries as per
each working day. This show searching a code is a frequent and highly important
activity. The performance of the code search is highly depending on the text involved
in the search term and the code snippets [12]. Code search is difficult when search term
258 S. Birari and S. Bhingarkar / Using Artificial Intelligence in Source Code Summarization
specified as input do not have the same words as the corresponding source code. Thus,
well written comments will lead to effective code search.
As discussed earlier, source code summarization is task which generates the comments
from the given code snippet. In this section, we present some existing techniques
implemented for source code summarization which includes encoder-decoder model,
language model, and reinforcement learning etc.
value generated by the actor network at each time step. In this work, the evaluation
score i.e., Bilingual Evaluation Understudy (BLEU) is defined as reward.
The proposed work effectively captures the semantics of the code snippet due to use
of AST. Also, actor-critic network predicts the summary, which resolves exposure bias
issue. But this model is evaluated only on certain Python code and comment pairs,
which may not represent all the types of comments and not generalized for other
programming language. Also, only BLEU score is utilized in order to calculate the
reward, which may not satisfy human evaluation criteria.
PBMT is not able handle hierarchical correspondence of the source code and thus Tree
to String Machine Translation (T2SMT) is useful. It makes use of parse tree of the
input sentence. In this technique instead of phrase pair, “derivation” is introduced,
where it represents relationship between source subtree and target phrase.
The above proposed approach is based on rule-based machine translation; thus, it
can be generalized for variety of languages by applying corresponding rule. But this
methodology does sentence wise translation due to which it is unable to handle
multiple statements.
Natural language generation is a subfield of AI which translates given data into natural
language such as English. In [15], source code summarization technique is utilized
which generates description in English for given Java code. This approach is works by
analyzing how method is invoked. In this approach, PageRank algorithm is used to find
most important method in the most important context. Then Software Word Usage
Model (SWUM) identifies the keywords from the action performed the important
methods.
Finally, the custom Natural Language Generation (NLG) technique to generate the
English descriptions which describes what methods actually do. In this NLG technique,
first 6 different types of messages are created which represents different contexts of the
methods. Table 1 shows different types of messages and its corresponding explanation.
Quick Summary Message This is short sentence which gives description of the
function.
Return Message Return type of the method is given by this type of
message.
Importance Message It shows the importance of the method based on
PageRank.
Output Used Message This is to describe maximum 2 methods which calls this
method.
Call Message This describes maximum 2 methods which is called by
this method.
Next step is lexicalization, in which according to above message type phrases are used
to describe it. After lexialization, more readable phrases are generated in the
aggregation. Finally, senteneces are generated from the phrases generated from above
step.
The above proposed approach makes use of context specific information, due to
which it provides meaningful situation based or contextual output.
3.4. Tree Convolutional Neural Network (Tree CNN) based source code summarization.
Convolutional layer applied to neural network helps to extract important features from
the input. In [16], Tree CNN is utilized in which program’s structure is captured using
S. Birari and S. Bhingarkar / Using Artificial Intelligence in Source Code Summarization 261
Abstract Syntax Tree (AST). AST generally captures syntactic structure of the
language using hierarchical representation. Each component in the program is
represented as AST’s node. AST’s node is denoted as a vector based on coding
criterion. Then convolutional layer detects the structural features of the program. The
new tree generated after convolution, has same size and shape as of original one. To
solve this issue, dynamic pooling is utilized. This proposed approach basically
classifies program according to corresponding functionality.
Above proposed work captures the meaning of the code snippet due to its
hierarchical representation by using AST. But it exposed to training data which may
cause it to suffer from the exposure bias.
Source code Markup Language (SrcML) converts the source code of the various
programming languages like Java, C, and C++ to XML file [17]. In the proposed work
[18], input is the XML file for the given code snippet and output is the document file.
This target document file is a combination of various parts in which every part gives
description of the important part of the code snippet. SrcML considers numerous
factors like white spaces, classes, parameters, and conditions and accordingly XML
files is generated. The main components of the source code are represented with the
help of the tags in XML like <class>, <function>, <loop> etc. Feature extractor fetches
the data generated from XML file. Using XPath, queries are performed to identify each
object from the code snippet that extracts four features: attributes, conditions, calls,
functions. The variables or parameters included in the source code are identified as
attributes. The tag<decl> is used to fetch these attributes. Conditions include “if”
conditions as well as several types of loops. Number of calls performed by the source
code are grouped into calls and are presented with the tag <call>. The functions with
the name and the data type of the value returned by the code snippet are represented
with the tag <function>. The feature extractor generates a program structure
information file that can be used by code description generator. Code descriptor
generator reads two files one is source code, and another is program structure
information file then generates the comments based on the source code and related
information.
The above proposed model makes use of tags which focuses on core components of
the code, due to which complete and clear comments can be generated. Although it
focuses on main concepts of code but does not show the inheritance relationships
among classes. Also, proposed model only considers Object oriented aspects of the
programming language and thus its useful only in case of object-oriented language.
4. Conclusion
Source code summarization technique generates the descriptions from the source code,
which describes what source code intend to perform. Summary of the source code is
useful for software maintenance, code search as well as code categorization.
Most of the existing source code summarization techniques were unable to capture
262 S. Birari and S. Bhingarkar / Using Artificial Intelligence in Source Code Summarization
the structure of the source code and facing the exposure bias issue. Also, it is necessary
to build the source code summarization technique that will have human evaluation
criteria as well. In future, Generative Adversarial Networks (GAN) in which generator
and discriminator can be combinedly designed to generate the summary which can be
helpful to deal with the exposure bias issue.
References