0% found this document useful (0 votes)
61 views11 pages

XML Framework For Language Neutral Representation

This document discusses an XML-based framework for language-neutral program representation and generic analysis. It proposes representing programs at different levels of abstraction using XML, including abstract syntax trees (ASTs), intraprocedural flow and dependence graphs. While existing representations are tightly coupled to programming languages, this framework aims to facilitate the development of generic analysis tools by combining language-neutral representations.

Uploaded by

Isaias Soares
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views11 pages

XML Framework For Language Neutral Representation

This document discusses an XML-based framework for language-neutral program representation and generic analysis. It proposes representing programs at different levels of abstraction using XML, including abstract syntax trees (ASTs), intraprocedural flow and dependence graphs. While existing representations are tightly coupled to programming languages, this framework aims to facilitate the development of generic analysis tools by combining language-neutral representations.

Uploaded by

Isaias Soares
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/4129315

An XML-Based Framework for Language Neutral Program Representation and


Generic Analysis

Conference Paper  in  Proceedings of the Euromicro Conference on Software Maintenance and Reengineering, CSMR · April 2005
DOI: 10.1109/CSMR.2005.10 · Source: IEEE Xplore

CITATIONS READS
17 132

2 authors, including:

Kostas Kontogiannis
National Technical University of Athens
185 PUBLICATIONS   3,320 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

PhD at the University of Waterloo View project

All content following this page was uploaded by Kostas Kontogiannis on 01 October 2014.

The user has requested enhancement of the downloaded file.


An XML-based Framework for
Language Neutral Program Representation and Generic Analysis

Raihan Al-Ekram and Kostas Kontogiannis


Dept. of Electrical and Computer Engineering
University of Waterloo
Waterloo, Ontario, Canada
Email: {rekram | kostas}@swen.uwaterloo.ca

Abstract Mathematical Markup Language (MathML) [23] is


defined for electronic interchange of mathematical
XML applications are becoming increasingly
symbols, equations and formulae or Voice Extensible
popular to define structured or semi-structured
Markup Language (VoiceXML) [24] is developed for
constrained data in XML for special application areas.
voice markup and telephony call control to enable
In pursuit there is a growing momentum of activities
access to the Web using spoken interaction. Such
related to XML representation of source code in the
markup languages are becoming increasingly popular
area of program comprehension and software re-
because XML is simple, easy to understand,
engineering. The source code and the artifacts
extensible, searchable, open standard, interoperable
extracted from a program are necessarily structured
and there is a wide range of tool support for creation,
information that needs to be stored and exchanged
manipulation and transformation of XML documents
among different tools. This makes XML to be a natural
automatically.
choice to be used as the external representation
formats for program representations. Most of the XML In pursuit there is a growing momentum of
representations proposed so far abstract the source activities related to XML representation of source code
code at the AST level. These AST representations are in the area of program comprehension and software re-
tightly coupled with the language grammar of the engineering. Various XML applications namely
source code and hence require development of JavaML [8] [12], CppML [12], srcML [9], PLIXML
different tools for different programming languages to [10] and PascalML [10] have been proposed to
perform the same type of analysis. Moreover AST represent the source code written in different
abstracts the program at a very fine level of programming languages. Some of them represent the
granularity and hence not suitable to be used directly complete AST of the source code while the others
for higher-level sophisticated program analysis. As produce a partial AST representation by partially
such, we propose XML applications for language marking up the source at the focal point of analysis.
neutral representation of programs at different levels Some of them are just syntax preserving, whereas the
of abstractions and by combining them we present a others preserve non-syntactic lexical information as
program representation framework in order to well. But all of the AST representations are tightly
facilitate the development of generic program analysis coupled with the language grammar of the source code
tools. and hence require development of different tools for
different programming languages to perform the same
1. Introduction type of analysis. Moreover AST abstracts the program
The Extensible Markup Language (XML) [20], a at a very fine level of granularity and hence not
World Wide Web Consortium (W3C) [21] standard, suitable to be used directly for higher-level
has been widely accepted for storing and exchanging sophisticated program analysis. As such, in this paper
structured and semi-structured documents. Many XML we propose XML applications for language neutral
sublanguages have been developed to define representation of programs at different levels of
constrained data in XML format for special application abstractions and by combining them we present a
areas, often by means of a Document Type Declaration program representation framework in order to facilitate
(DTD) or XML Schema [22] definition. For example the development of generic program analysis tools.
The rest of the paper is organized as follows: 2.1.2 Intra-procedural Flow and Dependence
Section 2 provides background information on Graphs
different program representation formalisms at The next higher-level abstractions of source code
different levels of granularity and related work in are the flow and dependence graphs. These graph data
representing them using XML. Section 3 presents structures are abstractions in terms of control flow and
language neutral AST representations based on generic data flow of the program and can be represented in a
language models. Section 4 discusses the proposed programming language independent way. The intra-
XML applications to represent the higher the artifacts. procedural graphs are for representing a single
Section 5 presents the representation framework that subroutine, procedure or function within a program.
will facilitate the development of generic program
A Control Flow Graph (CFG) [2] provides a
analysis tools. Section 6 describes a prototype
normalized view of all possible flow of execution
implementation of the framework. Finally Section 7
paths of a program. A CFG is a rooted directed graph
concludes the paper.
showing the basic blocks in a program and the possible
2. Background and Related Work immediate transfer of control from one basic block to
another. The CFG representation is extensively used
In this section we discuss different source code for data flow analysis, code optimization and testing.
representation formalisms and some higher-level
A Program Dependence Graph (PDG) [3] is a
abstractions of source code that focus on different
combined explicit representation of both control and
aspects of a program. We also investigate the existing
data dependences in a program. The PDG is also a
XML based external formats for storing and
rooted directed graph that consists of nodes
exchanging these program representations.
representing the statements and predicate expressions
2.1 Program Representation Formalisms in the program and edges connecting them
While the source code is the original artifact of a representing the control and data dependences between
software system, it is written and stored in ASCII plain them. The control dependence edges are labeled either
text format and is not suitable to be used directly for True or the truth-value of the predicate and the data
sophisticated program analysis. More structured and dependence edges are labeled by the variable name
abstract representations are needed to enable that causes possible flow of data values between the
algorithmic analysis and manipulation of programs. So nodes The PDG is used for code optimization,
the source code needs to be represented at different parallelism detection, loop fusion, clone detection etc.
levels of granularity. It is also used for performing slicing for maintenance
and re-engineering purpose.
2.1.1 Syntax Trees
2.1.3 Inter-Procedural Flow and Dependence
A Parse Tree [1] is a hierarchical graphical
Graphs
representation of the derivations of the source code
from its grammar. The interior modes of the tree Understanding the flow of information within a
represent the non-terminals and the leaves terminal single subroutine is not sufficient for optimization or
symbols of the grammar. An Abstract Syntax Tree analysis of the complete system, which is comprised of
(AST) [1] is a more economical representation of the many procedures and files.
source code while abstracting out the redundant The System Dependence Graph (SDG) [4] is an
grammar productions from the parse tree. The source extension to PDG for programs with multiple
sentence can be reconstructed from a Depth-first procedures. The SDG is constructed by connecting the
inorder traversal of the tree nodes. individual PDG of each procedure with some
The syntax trees are the basic source code additional edge types to correspond to procedure calls,
representations at the finest level of granularity. These parameters passed and return values.
data structures are used by compilers to analyze and Call Graphs [5] [6] are program abstractions used in
transform source code entities. They also serve as the traditional inter-procedural analysis. It’s a graphical
primary input for source code analysis and for representation of the caller or callee relationships
constructing other representations for higher-level among the procedures of a program, where the nodes
program analysis. The syntax trees are the abstraction indicate the procedures and the arcs indicate the calls.
of the source code in terms of the language grammar The nodes and arcs in a call graph may also contain
and hence are heavily dependent on the programming attribute labels (e.g. line number of the call or file
language. name of the procedure) to enhance the graph with
additional inormation. There can optionally be new
entities in the graph (e.g. abstract data types and their in order to use it for static extraction of facts. This is a
usage relationships) in addition to the procedure calls. markup technique where the tags are superimposed on
An extention to call graph is the Program Summary the source code keeping the original code as it is. The
Graph (PSG) that takes into account the reference markups explicitly describe the internal structure of the
parameters and global variables at the individual call code preserving the comments and the formatting
points. information. The srcML is defined by an XML DTD
From the basic graph higher level call graph can be constructed from the C++ language grammar.
constructed to show relationships among files, The srcML allows incomplete parsing of the source
modules or architectural entities instead of procedures. code to generate a partial AST by using a multi-pass
Other than inter-procedural data flow analysis for multi-stage prasing technique with a partial grammar
optimization, call graphs are also used for design specification. This enables controlling the parsing upto
recovery, architecture extraction or other reverse the desired level of interest depending on the focus of
engineering analysis. the analysis. This approach of parsing, marking up
only the selected constructs of interest while leaveing
2.2 Program Representations using XML others as it is, is known as island parsing.
Simic and Tolnik [7] explore the prospects of
representing source code using XML in place of 2.2.3 XMLizer
classical palin text format. They demonstrate that an McArthur et al. [10] presents the XMLizer tool to
XML grammar can improve the code structure, transform source code of several programming
formatting, querying possibilities and will allow languages into their respective XML representations in
making orthogonal extensions to code for annotations, order to facilitate re-engineering and migration. The
revision control, access control and documentation. PL/IX Markup Language (PLIXML), the Pascal
There is a spectrum of levels of granularity at which Markup Language (PascalML) and the Java Markup
source code is represented. Among them the AST Language (JavaML) are defined with their own DTDs
representation provides the most detailed information to represent PL/IX, Pascal and Java source code
from the source code. Hence most of the XML respectively. XMLizer uses a multi-weight parser that
applications for source representation proposed so far can generate ASTs of variable granularity by allowing
are based on the AST notation of a program. designated syntactic construct to remain unparsed. This
allows preserving certain lexical information, e.g.
2.2.1 Java Markup Language (JavaML) comments, by attaching them to unparsed constructs.
Badros [8] proposes an XML application, namely 2.2.4 Agile Parsing
the Java Markup Language (JavaML), to represent
Java source code in terms of its AST in order to Cordy [11] in his paper describes a method for
facilitate tools to peroform software engineering extending and generalizing the partial markup idea of
anlysis by leveraging the abundance of XML tools and island or multi-weight parsing using the agile parsing
technologies. The JavaML is defined by an XML technique of the TXL [19]. This approach selectively
DTD, where the elements represent the structure of the marks up only those AST nodes in the source that are
AST and most of the source code information are relevant to a particular analysis task. Using grammar
stored as attributes on the element tags. overrides and utilizing TXL’s ordered ambiguity
resolution a very precise form of constructs can be
JavaML is a complete syntactic represention of the specified for markup, without any modification in the
AST and hence the formatting and other lexical base grammar.
information in the source code are not preserved. In
addition to representing the syntax of the source code, This parsing technique is programming language
JavaML stores few semantic information as well. For independent and has been used with grammars for
example IDREF tags are used to refer to the Java, C++, COBOL, PL/I and RPG. There are no
declaration of a variable from the locations where it is DTDs defined for the markups, the non-terminal
used, which can be used for scope resolution or getting symbols of the grammar of the languages are used as
the type of a variable easily. the markup tags. As a result the markups are still
strongly coupled with the respective language
2.2.2 Source Code Markup Language (srcML) grammar.
Collard et al. [9] describes a technique to convert 2.2.5 Graph Exchange Language (GXL)
the C++ source code into an XML representation,
namely the Source Code Markup Language (scrML), The Graph Exchange Language (GXL) [17] [18] is
an XML based language for describing graphs. It
evolved from unification of other existing graph a higher level of abstraction. The AST representations
description languages. Unlike the other representations based on the generic model will be able to handle
discussed, GXL was not originally intended to constructs and represent source code from various
represent the source code. Hence there is no schema object-oriented languages in a uniform format. Tools
defined in GXL to represent any software artifact. built on the generic format, e.g. a tool to extract object
Instead, it provides features to specify the schema for model, will be able to analyze programs written in any
the data as well as the data itself in the same format. object-oriented language. The same argument holds for
The higher lever program representation formalisms the family of procedural languages and so on.
being graphs in nature make GXL a good candidate for
their representation. 3.1 Generic Procedural Model
Zou and Kontogiannis [14] [15] [16] proposed a
3. Modeling Programming Languages generic model and an XML application, Procedural
Even though AST is the fundamental source code Markup Language (ProcML), for representing the
representation formalism for building software analysis procedural languages in XML. Their proposed model
tools, AST represenatations are strongly tied with the is derived from programming languages like C,
corresponding language grammars. Which requires the Fortran, Pascal and COBOL.
development of different tools to perform same type of In the first step the AST representation of individual
analysis for programs written in different programming languages are modeled using UML. The classes in the
languages. To enable portability of the representations UML model encode the AST nodes, which are the
and building generic tools the AST representations basic language constructs and the attributes gathered in
should be decoupled from the laguage grammars. them. The class associations represent the attributes of
Over a family of programming languages the key non-primitive language syntactic types.
concepts remain the same and they share many The second step is to identify the functionally
common features. For example object-oriented equivalent constructs in different languages and
languages Java and C++ both have the notion of class, generalize them at a higher level of abstraction. For
method/function, inheritance etc. Hence it is possible example subroutine in Fortran and function in C
to develop a generic model of object-oriented denote similar concepts that can be generalized as a
languages by studying the grammars and a) identifying unique term procedure. Figure 1 presents a part of
the commonalities and obtaining a generalization and their proposed generic model for procedural languages.
b) identifying the variabilities and aggregating them at

Figure 1: Generic Procedural Language Model


The UML diagram is a graphical representation of program representations for abstractions at a level
the model. In third step, for storage and model higher than the AST. The higher-level program
interchange, the UML models are encoded in XML abstractions are the intra and inter procedural flow and
DTD definitions. This results in one DTD for each of dependence graphs of a program. Among them the
the languages – CML, FortranML and PascalML most commonly used representations in program
corresponding to C, Fortran and Pascal language and analysis are CFG, PDG and Call Graph. As part of this
one for the generic model – ProcML. The produced research we propose XML applications CFGML,
DTDs are effectively the models and the XML PDGML and CGML to represent these graph data
representations of the ASTs are instances of them that structures respectively.
will be validated against the models. For each of the higher-level representations we first
The generation of XML files representing the ASTs identified the basic elements that constitute the
works as follows – a) XML ASTs for individual representation and the relationships among these
languages are generated in conformance to their own elements. Based on it we developed a UML model for
language model DTDs and b) XML ASTs for specific each of the representations. In doing so we realized
language models are transformed to the generic model that all these representations share some common
using XSLT mapping programs. elements. The common elements are the basic building
block constructs of a program and the relationships
3.2 Generic Object-Oriented Model among them. These constructs are statements,
Mamas and Kontogiannis [12] [13] in their work variables, data types, functions etc. and the
proposed an XML application Object-Oriented relationships are the uses/definitions/declaration of the
Markup Language (OOML) as a generic model for variables in the statements, declarations/calls to the
representing object-oriented programming languages. functions etc. We call these constructs and
OOML is derived by generalizing JavaML and relationships the Facts. The higher-level
CppML, language models for Java and C++ languages representations use the Facts and define new constructs
respectively. Table 1 lists some of the mappings from and relationships, specific to the particular
JavaML and CppML entities to OOML entities. representation, on top of them.
Table 1: JavaML, CppML to OOML Mapping 4.1 FactML
JavaML CppML OOML The first step is to develop a UML model for the
CompilationUnit Program Program program Facts. The building block constructs of the
ImportDeclaration Include Include Facts are represented as classes and the relationships
among them are shown as associations or association
ClassDeclaration Class Class
classes. This results in classes named Type, Variable,
MethodDeclaration Function Method Statement, and Function in the model. Each member of
FieldDeclaration Variable Variable- the Variable class is associated with a member of the
Declaration Type class by its data type. A Variable and a Statement
Block LexicalBlock- Body are related with a declaration relationship that is a
Statement simple association, whereas uses and definitions of a
SwitchStatement SwitchStatement Conditional- Variable in a Statement is more complex and requires
IfStatement IfStatement Statement an association class. There can be three different
relationships between a Statement of and a Function. A
DoStatement DoStatement Loop
Function is declared in a Statement, a Function
ForStatement ForStatement
consists of many Statement and a Statement can call
WhileStatement WhileStatement
one or more Function. Figure 2 presents the complete
UML model of the Facts.
4. Modeling Higher Level Artifacts
In the second step the UML model is transformed
While the AST level representations are useful for into an XML DTD declaration using following
some type of analysis, they are not usable for production rules
sophisticated higher-level analysis. For example in
order to perform data flow analysis on a program the Classes are mapped as elements
CFG representation of the program is required or in Attributes of the classes are mapped as
order to perform design recovery for a software system attributes in the elements
the call graph from the source programs is required. Aggregations are mapped as sub-elements
But the existing XML applications lack in defining separated by or (|)
Inheritances are mapped as sub-elements 4.2 CFGML
Simple associations are shown by IDREFs A CFG is a directed graph indicating the basic
Association classes are mapped as elements blocks in a program and possible flows of control from
showing the associations by IDREFs one basic block to another. A basic block contains a
Elements with same tags originating from sequence of consecutive program statements. The
same node are grouped as sub-elements under UML model and hence the XML DTD presented in
one bigger element. Figure 4, describes these basic blocks and the flow of
Figure 3 presents a part of the DTD derived from control among them. Description of any basic building
the UML model. For classes Statement and Variable in block construct, e.g. Statement, is linked from the
the UML model there is one element each in the DTD. FactML using XLink [27].
Collections of them are grouped under bigger elements
Statements and Variables. The optional IDREF Figure 5 shows an example C program and Figure 6
attribute function in the Statement element refers to a shows the corresponding CFG of the program as an
Function element the statement is part of and the instance of the CFGML. The FactML instance of the
IDREF declared in Variable refers to a Statement the program is assumed to be stored as Facts.xml
variable is declared in. The association class UseDef is <!ELEMENT CFG (Blocks?, Flows?)>
mapped to its own element and grouped under a single <!ATTLIST CFG
program CDATA
UseDefs element. scope CDATA>
<!ELEMENT Blocks (Block*)>
Facts
<!ELEMENT Block (Statement*)>
-id <!ATTLIST Block
-program
id ID #REQUIRED
label CDATA #REQUIRED>
1 1 1
<!ELEMENT Statement EMPTY>
<!ATTLIST Statement
id ID #REQUIRED
Declaration Declaration
xlink:type (simple) #FIXED “simple”
* * 0..1 * 0..1 * *
xlink:href CDATA #REQUIRED>
Type Variable Statement * 0..1 Function
<!ELEMENT Flows (Flow*)>
-id 1 * -id -id -id
<!ELEMENT Flow EMPTY>
-name -name -lineno -name
* * -scope
<!ATTLIST Flow
-category -scope -tag
* * -signature id ID #REQUIRED
from IDREF #REQUIRED
UseDef
to IDREF #REQUIRED>
Call
-id -id Figure 4: DTD for CFG, CFGML
-category --
<1> main ()
<2> {
Figure 2: UML Model for Facts <3> int a = 0;
<4> if (a>3)
… <5> a = a+3;
<!ELEMENT Statements (Statement*)> <6> a = 10;
<!ELEMENT Statement EMPTY> }
<!ATTLIST Statement Figure 5: An Example C Program
id ID #REQUIRED
lineno CDATA #REQUIRED <CFG>
tag CDATA <Blocks>
function IDREF> <Block id=1 label=1>
<!ELEMENT Variables (Variable*)> <Statement id=2 xlink.href=”Facts.xml#3”/>
<!ELEMENT Variable EMPTY> <Statement id=3 xlink.href=”Facts.xml#4”/>
<!ATTLIST Variable </Block>
id ID #REQUIRED <Block id=4 label=2>
name CDATA #REQUIRED <Statement id=5 xlink.href=”Facts.xml#5”/>
scope (Local|Global|Param|Ext) “Local” </Block>
declared IDREF <Block id=6 label=3>
type IDREF> <Statement id=7 xlink.href=”Facts.xml#6”/>
<!ELEMENT UseDefs (UseDef*)> </Block>
<!ELEMENT UseDef EMPTY> </Blocks>
<!ATTLIST UseDef <Flows>
id ID #REQUIRED <Flow id=8 from=1 to=4 />
category (Use|Def) “Use” <Flow id=9 from=1 to=6 />
statement IDREF #REQUIRED <Flow id=10 from=4 to=6 />
variable IDREF #REQUIRED> </Flows>
… </CFG>
Figure 3: DTD for Facts, FactML Figure 6: CFGML instance of the C Program
Generic Analysis Tools
Data Flow Program Architechtural
...
Analysis Slicing Recovery

CFGML PDGML/SDGML CGML ...

FactML
Higher Level Representations
External Tools

AST Level Representations


OOML ProcML

JavaML CppML ... CML PascalML FortranML ...

Java C++ ... C Pascal Fortran ...

Source Code

The Program Representation Framework

Figure 7: System Architecture for the Program Representation Framework

4.3 PDGML and CGML markup languages, i.e. JavaML, CppML, CML,
PascalML and FortranML. Layer 1.2 are the AST
Similarly PDG and Call Graphs can be modeled in representations derived from the generic model of the
UML and corresponding XML DTDs can be generated language family, i.e. ProcML and OOML.
from them.
Layer 2 is the next level of abstraction in terms of
5. The Representation Framework the different intra-procedural and inter-procedural
graphs. This layer is also consists of two sub-layers.
In figure 7 we present the multi-layered framework Layer 2.1 represents the basic facts of a program in the
for language neutral representation of program FactML format. Layer 2.2 is the representations for
artifacts. We also demonstrate the usage of the intra-procedural and the inter-procedural dependence
framework for building generic program analysis tools. and flow graphs of the program expressed as CFGML,
The framework follows a pipe and filter type PDGML, SDGML and CGML.
architectural style. The pipe components are the
different layers of abstractions of the program source 5.2 Transformers
and the filter components are the representation A set of transformer tools is required to convert the
transformers and the analysis tools. representations from one level to the next higher level
5.1 Abstraction Layers of abstractions. Some of them are source code
transformers that are parsers of the source text in order
There are three distinct layers corresponding to to emit corresponding AST in the language specific
three different levels of abstractions of source code in XML format. There has to be one transformer for each
the framework. Layer 0 is the original source text of of the languages to be analyzed.
the program to be analyzed as it is.
The rest of transformers are XML to XML
Layer 1 is the first level of abstraction of the source transformers. These transformers can be built using
code in terms of the AST of the program. We choose XSLT stylesheets [25], XPath/XQuery [26] or DOM
to adopt the AST representations proposed by Zou and [28] manipulation. There will be once transformer for
Mamas to fit in this layer. Since these representations each of the following conversions
also include the generic representations for procedural
and object-oriented language family, they will provide JavaML, CppML to OOML
language neutral representations of the AST. This layer CML, PascalML, FortranML to ProcML
consists of two sub-layers. Layer 1.1 are the ASTs OOML, ProcML to FactML
representations in programming language specific FactML to CFGML, PDGML, CGML
5.3 Analysis Tools 6.2 Operational Statistics
Various program analysis tools can be written on In this section we evaluate the proposed framework
top of the proposed framework. Since these tools will in terms of the sizes and the time required to generate
work on language neutral representations of the the representations by the prototype toolset. Five input
program, it is possible to develop of a single tool to files of different sizes were used to measure the size
perform a particular type of analysis on a source and time parameters. These files were chosen from a
program written in any programming language. For variety of sources ranging from student course projects
example a generic data flow analysis tool can be to standard utility library. The prototype was
written to work on the CFGML or a single slicing tool developed using the Java programming language (JDK
can be written to use the PDGML to perform program 1.3) and all the experiments were run in a Sun
slicing on source code of any language. UltraSPARC III 440 MHz station with 512 MB of
All the representations in the proposed framework RAM and running Solaris 8 Operating System.
are XML and hence can be easily transformed to any Table 2 presents the size of the generated FactML
other formats using XSLT or XQuery in order to files and the time required to generate them by the fact
enable exporting of data to an external tool. If the extractor tool. The size of the FactML is approximately
external tool also uses an XML representation for its 5 times the source code. Table 3 summarizes the
data then it is straightforward to import the data using relationship between the size of a method and the size
the same techniques. However if the external tool does of its corresponding PDGML and the time taken to
not use XML representations, additional mapping tools produce it. Even though the general tendency of the
are needed to map the external formats to the internal size of the PDGML is to increase with the size of the
XML representations. method, it may not be the case always. When there is a
low number of def-use chaining in the program, the
6. A Prototype Implementation number of edges in the graph is low and it will result in
We have developed a prototype toolset based on the a smaller PDGML size. Finally Table 4 shows the
proposed framework. Our prototype works on the results of slicing based on the final uses of a given
JavaML-OOML representation of Mamas and Ret4J variable. The size of the slice compared to the size of
[29] toolkit to generate JavaML-OOML instances of the method shows the same property as the size of the
Java programs. Minor modification is done to Ret4J to PDG. The time required to slice a PDG is quiet
include a lineNumber attribute in the generated XML. reasonable and depends on the size of the source.

6.1 Analysis Tools Table 2: Experimental Results for Fact Extraction


As part of the toolset we have developed a fact Source FactML FactML
extractor that takes an OOML file as input and Program Size Size Time
generates a FactML file. The tool works on the DOM (bytes) (bytes) (ms)
tree of the OOML instance and makes XPath query to MyMath.Java 187 2,030 224
extract information from it. We have developed a PDG Voter.java 3,822 17,502 1,487
generator that works on both the OOML and FactML GUI.java 4,994 20,697 1,498
files and generates a PDGML instance.
UnboundedLife.java 10,831 33,800 3,849
The toolset also consists a PDG slicer that slices a PDG.java 22,200 81,072 11,403
PDGML instance and emits a reduced PDG based on
the algorithm given in [4]. The statements remaining in
the sliced PDG will comprise the program slice. The Table 3: Experimental Results for PDG Creation
slicer can perform the following kinds of slicing: PDGML PDGML
Class: Size
Backward slicing for a given program point and Size Time
Method (LOC)
a variable use and the final use of a given (bytes) (ms)
variable MyMath:factorial 13 2,466 154
Backward decomposition slicing for the uses of UnboundedLife:restore 30 6,194 402
a given variable GUI.java 39 11,144 748
Forward slicing for a given a program point and PDG:backwardSlice 40 12,166 711
a variable use Voter:fix 55 11,086 945
Forward decomposition slicing for the
definition of a given variable
Table 4: Experimental Results for Slicing [6] G. C. Murphy, D. Notkin and E. S. Lan. An
Source Slice Slicing empirical study of static call graph extractors.
Class:Method:Variable Size Size Time Proceedings of the 18th International Conference
(LOC) (LOC) (ms)
on Software Engineering. March 1996.
MyMath:factorial:i 13 10 2 [7] Hrvoje Simic and Marko Topolnik. Prospects of
UnboundedLife:restore:x 30 13 3 Encoding Java Source Code in XML. Conference
GUI.java:labels 39 24 10 of Telecommunications, 2003.
PDG: backwardSlice:list 40 31 26 [8] Greg J. Badros. JavaML: A Markup Language for
Voter:fix:game 55 23 10 Java Source Code. International World Wide Web
Conference, 2000.
7. Conclusion [9] Michael L. Collard, Huzefa H. Kagdi and
In this paper we presented a framework for Jonathan I. Maletic. An XML-based Lightweight
language neutral program representation. The C++ Fact Extractor. International Workshop on
framework is based on a multi-layered abstraction of Program Comprehension, 2003.
source code artifacts represented using several XML
[10] Gregory McArthur, John Mylopoulos and Siu Ng.
applications. The framework adopts the existing XML
An Extensible Tool for Source Code
applications for source code representation and defines
Representation Using XML. Working Conference
new applications to represent higher-level program
on Reverse Engineering, 2002.
abstractions. The framework is extensible, new
representations and tools can be added to it to facilitate [11] James R. Cordy. Generalized Selective XML
different generic analysis tasks. Markup of Source Code Using Agile Parsing.
The obtained operational statistics from the International Workshop on Program
prototype toolset show that the tools operate fairly Comprehension. 2003
accurately and with reasonable performance. The sizes [12] Evan Mamas and Kostas Kontogiannis. Towards
of the different intermediate representations and the Portable Source Code Representations using
time required to generate them are reasonable. As a XML. Working Conference on Reverse
conclusion, this paper provides the fundamental Engineering, 2000.
mechanism to build generic tools that will perform
[13] Evan Mamas. Design and Implementation of
program analysis independently of the programming
Integrated Software Maintenance Environment.
language used to write the program.
MASc Thesis, Department of Electrical and
8. References Computer Engineering, University of Waterloo.
2000.
[1] Alfred V. Aho and Jeffrey D. Ullman. Principles
of Compiler Design. Addison-Wesley Publishing [14] Ying Zou and Kostas Kontogiannis. A Framework
Company. April 1979. for Migrating Procedural Code to Object Oriented
Platforms. Asia Pacific Software Engineering
[2] Francis E. Allen. Control flow analysis, ACM Conference, 2001.
SIGPLAN Notices, Volume 5 Issue 7. July 1970.
[15] Ying Zou and Kostas Kontogiannis. Incremental
[3] Jeanne Ferrante, Karl J. Ottenstein and Joe D. Transformation of Procedural Systems to Object
Warren. The Program Dependence Graph and Its Oriented Platforms. Computer Software and
Use in Optimization. ACM Transactions on Applications Conference, 2003.
Programming Languages and Systems. July 1987.
[16] Ying Zou. Techniques and Methodologies for the
[4] Susan Horwitz, Thomas Reps and David Binkley. Migration of Legacy Systems to Object Oriented
Intreprocedural Slicing Using Dependence Platforms. PhD Thesis, Department of Electrical
Graphs. ACM TOPLAS, Volume 12 No 1. and Computer Engineering, University of
January 1990. Waterloo. 2003.
[5] D. Callahan, A. Carle, M. W. Hall, K. Kennedy. [17] Ric Holt, Andy Schürr, Susan Elliott Sim and
Constructing the Procedure Call Multigraph. IEEE Andreas Winter. Graph Exchange Language.
Transactions on Software Engineering, Volume 16 http://www.gupro.de/GXL/
Issue 4. April 1990
[18] R. Holt, A. Winter and A. Schürr. GXL: Towards
a Standard Exchange Format. Working
Conference on Reverse Engineering, 2000.
[19] James R. Cordy, C. D. Halpern and E. Promislow.
TXL: A Rapid Prototyping System for
Programming Language Dialects. Computer
Languages, January 1991
[20] XML.ORG, www.xml.org
[21] World Wide Web Consortium, www.w3c.org
[22] XML Schema, www.w3.org/XML/Schema
[23] MathML, www.w3.org/Math
[24] VoiceXML, www.w3.org/TR/voicexml20
[25] The Extensible Stylesheet Language Family,
www.w3.org/Style/XSL
[26] XML Query, www.w3.org/XML/Query
[27] XML Linking, www.w3.org/XML/Linking
[28] Document Object Model, www.w3.org/DOM
[29] Reengineering Toolkit for Java,
www.alphaworks.ibm.com/tech/ret4j

View publication stats

You might also like