0% found this document useful (0 votes)
60 views5 pages

Cloning Dectection

Research Parper
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views5 pages

Cloning Dectection

Research Parper
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

A Metric Space Based Software Clone Detection Approach

Zhu o LI
Computer Science College Zhejiang University
Hangzhou, China
[email protected]
Jianling SUN
Computer Science College Zhejiang University
Hangzhou, China
[email protected]


AbstractMetric space is a set with definition of distance
between elements within this set. This paper introduces metric
space into code clone detection, and uses the distance within a
metric space to measure the similarity level of code. It
proposed a process of building up a metric space to detect
clones in software system. Based on metric space which is
derived from software metric, the clone detection can get more
convenience and flexibility. We also exercise the approach in a
real industry project.
Keywords-metric space; clone detection; software metric;
proximity query (key words)
I. INTRODUCTION
There are a lot of CLONES inside software systems.
Copy-Past-Modify programming approach which is
common behavior in software development can easily bring
to a widely distributed code clones. This kind of behaviors
use to be informal, hard to control and not well documented,
which will raise a series of software maintenance problems
including but not limited to below:
- Program obstacles spread wider and quicker. This
becomes worse if the programmer is not aware of
these similarities.
- Program logic consistency across similar code slices
is volatile;
- Program scale increases unnecessarily which raises
the efforts of understanding the system;
Stated in [1], around 5% to 10% in a software system are
cloned code, and more than 60% of organizations effort to
maintain existing software [2]. This ratio is becoming higher
which also makes maintenance more expensive. There is
increasing need for an efficient and reliable approach to
detect code clones.
In this paper, we proposed a metric space based software
clone detection approach. It transforms software code slices
into metric space members with coordinate values, and then
measures similar level across all members based on distance
within the same metric space. The closer two members are,
the more similar they are from code perspective. The rest of
this paper is organized as follows. Section 2 introduces
existing clone detection mechanisms. Section 3 introduces
proposed approach about its math background - metric space
and metric space distance calculation, its workflow and
detail explain on each step. Section 4 gives a case study
verifying its feasibility in a real industry software system.
Section 5 wraps up the findings from the case study and
discusses future works.
II. RELATED WORKS
The research about code clone detection can be traced
back to 20 Century 90s [3] [4] [5]. It has been an active area
since then, and is tightly connected with code refactoring,
software re-engineering, software maintenance, etc. Surveys
[6][7] present main software clone detection approaches
which can be divided into 5 types.
The first type is text-based. It considers code slices as
sequences of strings and compares them with each other to
find the same strings [6]. Little or no source code
normalization is needed. The returned result set is accurate
considering its text-based nature, but is incomplete as it
cannot detect structural or functional clones.
The second type is token-based. It uses Lexer and/ or
Parser to transform source code into a sequence of tokens.
This sequence is then scanned to find the same token
sequences. The original code slices represented by the token
sequences will then be returned as clones. This approach is
more robust against code clones with noisy blank spaces or
comments but the accuracy level is not that satisfactory as
the normalization and token conversion process will bring in
quite a few false positive clones in result set. [8] explores
string algorithms to find suitable data structures and
algorithms for efficient token based clone detection.
The third type is Abstract Syntax Tree-based. It parses
source code into an AST instance, and then traverses it for
the same subtrees. The original code slices represented by
the subtrees are returned as clone suspects. AST contains
source codes hierarchy information hence the result sets
accuracy level is good but its scalability is not stable which
depends on the algorithm to build and compare AST during
run time. [9] discusses an automatic refactoring method of
detected code clones based on AST and static analysis.
The fourth type is Program Dependency Graphs based.
PDG reflects programs control flow and data dependencies.
Once a set of PDGs are obtained from source code,
isomorphic subgraph matching algorithm [10] is applied to
find the same subgraphs, and then the original code slices
represented by the subgraphs are returned as clones. PDG-
based approaches are robust in detecting functional
similarities like reordered statements, inter-winded code,
non-contiguous code, etc. because they analyze based on
both syntax and semantic information of program. But it can
_____________________________________
978-1-4244-5265-1/10/$26.00 2010 IEEE
hardly be applied on big scale systems considering the cost
to build PDG model.
The last one is Metric-based, which is also the basis of
our proposed approach. It gathers different metrics from
source code and uses these metrics to measure code clones.
In most cases, the source code is parsed to its AST/PDG
representation for metric calculation [11] [12]. And the
distance inside the metric space is used to detect code clones.
This approach is more scalable comparing with AST or PDG
based approaches because metric comparison is
straightforward comparing with AST/ PDG based clone
detection.
Table I wraps up the portability, accuracy, integrality and
scalability all across these 5 approaches.
TABLE I. CLONE DETECTION APPROACHES SUMMARY
Name Portability Accuracy Integrality Scalability
Text High High Low Relative to
Comparison
algorithm
Token Medium Low High High
AST Low High Low Relative to
Comparison
algorithm
PDG Low High Medium Low
Metric Relative to
defined
metric
High Medium High

III. METRIC SPACE BASED CLONE DETECTION
Below sections will give some background knowledge
about metric space, metric space distance and how to
conduct proximity query based on a given metric space
distance formula, and then describe details of the proposed
clone detection process.
A. Metric Sapce and Metric Space Distance
A metric space is a pair ( , ) X d , where X stands for an
infinite set of valid objects, and ( , ) d x y stands for a function
calculating distance between x and y, which are two
members in X . ( , ) d x y should have below properties so it
can be used to measure metric space distance:
- ( 1) , , ( , ) 0 p x y X d x y e > , positiveness;
- ( 2) , , ( , ) ( , ) p x y X d x y d y x e = , symmetry;
- ( 3) , ( , ) 0 p x X d x x e = , reflexivity;
- ( 4) , , , ( , ) ( , ) ( , ) p x y z X d x y d x z d z y e s + , triangle
inequality;
Given a k - dimension metric space, its members are
identified using k real valued coordinates ( ,..., )
1
x x
k
. For
member ( ,..., )
1
x x
n
and ( ,..., )
1
y y
n
, there are many different
ways to calculate metric distance between them on different
norm number. Table II below wraps up most common
formulas.
TABLE II. METRIC DISTANCE FORMULA
p-norm distance Table Column Head
p-norm distance Table Column Head
1-norm
1
n
x y
i i
i
= _
=

2-norm
1
2
2
1
n
x y
i i
i
| |
|
= _
|
|
=
\ .

p-norm
1
1
p
p
n
x y
i i
i
| |
|
= _
|
|
=
\ .

Infinite norm
1
lim
1
p
p
n
x y
i i
p
i
| |
|
= _
|
|
=
\ .

B. Proximity Query
Once a metric space is defined, we can conduct
proximity query to find out members who are close to each
other. There are 2 types of proximity query: Range [13],
Nearest neighbor [14].
- Range Query ( , ) Q q r : given q as benchmark baseline,
and r is distance range to be queried, it returns
member u in metric space( , ) X d , where ( , ) d q u r s .
- Nearest Neighbor Query ( ) NN q : given q as
benchmark baseline, it returns member u in metric
space( , ) X d , where all other members in the same
metric space has equal or longer distance to q .
C. Proposed Detection Process
Figure 1 indicates the process of proposed detection
approach in this paper. It can be divided into 3 fundamental
steps: Metric Definition, Metric Space Building and
Proximity Query, besides these 3 fundamental steps, there is
another cleaning step about Result Set Verification.
Metric Space Based Clone Detection
4
.
R
e
s
u
l
t
S
e
t
V
e
r
i
f
i
c
a
t
i
o
n
2
.
M
e
t
r
i
c
S
p
a
c
e
B
u
i
l
d
i
n
g
3
.
P
r
o
x
i
m
i
t
y
Q
u
e
r
y
1
.
M
e
t
r
i
c
D
e
f
i
n
i
t
i
o
n
1.1 Define Metric
Selection
Dimension
2.0.1 Metric
Composition
NG
2.1 Generate Raw
Metric Space M0
1.2 Define Metric
Selection Standard
3.3 Get Result Pair
P(O,Q) where Q
is the nearest
member to Oin
M
3.1 Select New
Metric Space M
from L
Y
OK
Expert
Review
Defined Metric
Elements
2.2.1 Select Metric
Space Member
Oas New Origin
2.2.2 Build New
Metric Space M
More
Members?
2.3 Add Mto a
List L
N
Y
More
Metric
Space?
3.2 Conduct NN
Proximity Query in
M
Y
3.4 Add P(O,Q)
to Result Set List
RS
N
End Processing
4.1 Verification
Check

Figure 1. Metric Space Based Clone Detection Process Diagram
1) Metric Definition
This step defines metrics either from existing software
metrics like KLOC, or specifically from target softwares
nature. It is preliminary of the whole process, but is also the
most critical step, as the quality of defined metrics will
determine ultimate result sets accuracy. Although many
comprehensive metric categories have been defined and
proved in previous research like [15], below characteristics
must be fulfilled on selected metrics:
- Accurate: it means the metrics must present software
systems complexity syntactically and semantically;
- Simple: it means the metrics must be clear and
straightforward for people to understand;
- Practical: metrics should be able to present software
efficiently and can be verified easily;
- Easy to get: metrics should be factors or features of
the software system that can easily be retrieved or
calculated via manual or automated approach;
2) Metric Space Building
Once metrics are defined from step1, they can be
followed to retrieve concrete values from target software.
However, the metric values cannot always be directly used to
build metric space because metrics might have different
weight in presenting software complexity. In order to build a
more exact metric space, a formalization step might be
required as stated in step 2.0.1 in figure 1. Detail sample
about this formalization will be given in later case study
section.
Assuming we have a set { , ,..., }
1 2
P P P
n
to be measured,
and m metrics are defined after formalization, each member
P
n
can be presented using a vector like ( , ,..., )
1 2
P p p p
n m
.
If a conceptual origin member (0,0,...,0) P
o
can be assigned,
whose metric values are all 0, with a pre-defined distance
formula ( , )
0
d P P
n
, we can then get a raw metric space
( , ) P d .2-norm distance formula is selected in this process
which is also known as Euclidean Distance and widely used
in many domains.
Raw metric space indicates an absolute presentation of
each members metric values. By setting each member as the
new origin point, using the same Euclidean Distance formula,
new metric space ' M can be built. It can be used in later step
to sort out the potential clone pair.
3) Proximity Query
n metric spaces can be derived from step 2 with each
member in original set as the origin point. Proximity query
will be conducted on all these metric spaces and the result
sets from each query stands for potential code clones to
origin point member. Nearest Neighbor Query is selected in
this approach as its result set is compact and easier for
analysis in our case study work.
4) Result Set Verification
This step is mandatory as no mechanism can ensure
100% accuracy. It can be automatic and/ or manual process.
Automatic check can be done first based on metric distances
4 features: positiveness, symmetry, reflexivity and triangle
inequality. Manual verification usually means expert review
which will count on reviewers domain knowledge and
experiences to filter out fake elements from result set.
IV. A CASE STUDY
We applied our approach in a service provider who
provides financial services to institutional investors.
The project we selected is a reporting project which
develops and delivers hundreds of accounting reports to its
clients. It maintains more than 900 report templates. About
300 of them are core reports and another 600 plus reports are
custom reports which are derived from core reports with
certain customization in both data and layout.
With the quick growth in business, the numbers and
complexity of all these reports expands. In accordance with
this, more and more code clones exist which makes this
project a good candidate for pilot. 8 core report templates
and 15 custom templates are randomly selected for this
experiment.
A. Metric Definition
As these report templates are developed using an
industrial software rather than direct programming language,
common code metrics can hardly be applied here considering
the efforts of converting and parsing these templates. Metrics
used in this experiment are mainly recommendations from
developers who maintain these reports.
Metrics can be divided into two categories: structural and
data model. Structural metrics indicates layout and hierarchy
complexity of a report template. Data model metric indicates
raw data and business logic complexity of a report template.
KLOC as a common code metric is also adapted due to its
easy-to-get nature.
TABLE III. METRIC DEFINITION IN CASE STUDY
# Name Description Metric
Category
1 # of Report
Section
Report Section is a structural
component that builds the
logical structure of the report.
Structural
2 # of Report
Grouping
Report Grouping stands for
how many levels of grouping
are applied on the incoming
data set.
Data
3 # of Conditional
Sections
Conditional Section is a
structural component that
conditionally instantiates one
of several components in a
section.
Structural
4 # of Sequential
Sections
Sequential Section is a
structural component that
produces multiple reports
within a single report object.
Structural
5 # of Frames Frame is a container for visual
components such as controls,
charts, and other nested frames.
Structural
6 # of Data Types
in The Incoming
Data Row & # of
Data Columns
under Each Type
It indicates how many types of
data contained in raw data and
how many data fields under
each data type.
Data
7 # of Control
Types and # of
Controls under
Each Type
A control is the primary visual
component in a report.
Structural
B. Metric Space Building
1) Metric Composition
A preliminary step in metric space building is to collect
raw metric values of defined metrics. However, as stated
before, not all metrics can be directly used in building metric
space. Like #6 and #7 metrics in the sample above, their
values are discrete and cannot reflect intended measurement
and need to be composited before building metric space.
Take #7 as an example, after raw values are retrieved,
one report template has below data table:
TABLE IV. COMPOSITE METRIC SAMPLE
# of controls and control types: 5
Date Text Double Label Currency
1 8 7 1 6
The composition process can be treated as a separated
metric space building process. The discrete control types are
treated as metrics in the target metric space and the
composited new metric value can then be calculated using
Euclidean Distance formula . The purpose is to use metric
space distance in this metric space as composite metric value
in macro metric space building in below step. In our example,
the new metric
value
1
2
2 5
2 2 2 2 2
1
0 1 8 7 1 6 12.288
i
i
x
=
| |
= = + + + + = |
|
\ .
_

2) Generate Raw Metric Space
Based on the collected/ composited raw metric values,
and the pre-defined distance formula, a raw metric space can
be built. All templates metric values are calculated based on
the same conceptual origin point (0, 0,..., 0)
o
P whose
coordinate values are all 0.
3) Build New Metric Spaces
Based on the raw metric space built in above step, we
continue building new metric spaces by setting each report
template as new origin point.
4) Proximity Query and Verification
After Nearest Neighbor query upon each metric space
built from step 3) above, we can get one report pair which
stands for the most similar report to that specific report set as
origin point. Table V summarizes all clone pairs derived
from each metric spaces Nearest Neighbor query. The report
on the left side of the pair is used as the origin point in their
metric space.
TABLE V. DETECTED CLONES RESULT SET
Similar Template
Pair
Template
Category
Vector
Space
Distance
Expert
Review
Comments
Report 10 Report
12
Core - Core 0.14696125 Valid
Report 12 Report
10
Core - Core 0.14696125 Valid
Report 1 Report 2 Custom Custom 0.1539989 Valid
Report 2 Report 1 Custom Custom 0.1539989 Valid
Report 15 Report
14
Custom Custom 0.325339 Valid
Report 14 Report
15
Custom Custom 0.325339 Valid
Report 9 Report 10 Core - Core 0.43826673 Valid
Report 19 Report
20
Custom Custom 0.469898 Valid
Report 20 Report
19
Custom Custom 0.469898 Valid
Report 8 Report 4 Custom Custom 1.000968 Valid
Report 4 Report 8 Custom Custom 1.000968 Valid
Report 5 Report 8 Custom Custom 1.024528 Invalid
Report 16 Report
15
Custom Custom 1.228755 Valid
Report 11 Report 9 Core - Core 1.4288057 Valid
Report 7 Report 3 Core - Core 1.588681 Valid
Report 3 Report 7 Core - Core 1.588681 Valid
Report 13 Report 7 Custom - Core 1.71408 Invalid
Report 21 Report
13
Custom Custom 2.314027 Valid
Report 17 Report
21
Core Custom 2.591144 Valid
Report 22 Report
18
Custom Core 2.776276 Invalid
Report 18 Report
22
Core Custom 2.776276 Valid
Report 6 Report 5 Custom Custom 3.06126 Valid
Report 23 Report
22
Custom Custom 3.908171 Valid
Table V is sorted from small to large based on the
distance. The closer the distance is, the easier this pair can be
merged as they are more similar. We conducted expert
review on the result set as verification process, and the result
is indicated in the last column in above table. For those pairs
whose distances are smaller than 0.5 and 1, the validity of
the result set (number of valid pairs/ number of total pairs) is
100%, and the overall validity of the result set is 87%. This
approves that our approach is accurate. A few more things
are worth mentioning here:
- All the pairs which are mutual proven like report 1
and 2 passed expert review;
- Valid means the detected template is the most
similar to the origin template across all the templates
queried, rather than means the pair is similar enough
for code merge;
C. Summary and Future Works
Metric space is an advanced version of metric based
software clone detection approach. It inherits advantages like
accuracy and scalability from software metrics. Furthermore,
it provides more convenience in converting structural or
semi-structural software sources into metrics, as well as
more flexibility via various mathematic manipulations on
metric space like proximity query. It has proven its accuracy
and efficiency in clone detection from experiment in this
paper. However, our experiment only stays on a preliminary
stage of this approach and there are a few action items we
need to further explorer in the future.
- Metric Definition: the definition process is more
experienced based rather than standard based.
Experiences can make the metric system specifically
accurate for a certain software system but can hardly
scale up to wider scenario with various technologies;
- Metric Space Building: the metric value retrieving
process is semi-manual and can hardly scale up
when the detection scope increases;
- Verification Check: current process is more based on
expert review which is low efficient. It will be very
helpful if we can make the verification check be
done based on certain pre-defined rules which can be
deduced from defined metrics;
REFERENCES
[1] Cory KapserMichael W Godfrey, Toward a Taxonomy of Clones
in source Code: A Case Study, Proceeding of the 1st International
Workshop on Evolution of Large Scale Industrial Software
Applications (ELISA).2003.
[2] Hanna, M, Maintenance Burden Begging for a Remedy, Software
Magazine, April 1993.
[3] Brenda S. Baker, On finding duplication and near-duplication in
large software systems, Second Working Conference on Reverse
Engineering, Los Alamitos, California, IEEE Computer Society Press
(1995) 8695.
[4] Brenda S. Baker, Parameterized diff, Proceedings of the 10th
ACM-SIAM Symposium on Discrete Algorithms (SODA'99),
Baltimore, Maryland, USA, (1999) 854-855.
[5] Brenda S. Baker, Parameterized Pattern Matching: Algorithms and
Applications, Journal Computer System Science, Vol. 52(1):2842,
February 1996.
[6] R. Koschke. Survey of Research on Software Clones. In Dagstuhl
Seminar 06301, 24pp., 2006.
[7] C.K. Roy and J.R. Cordy. A Survey on Software Clone Detection
Research. School of Computing TR 2007-541, Queens
University,115 pp., 2007.
[8] Hamid Abdul Basit, Simon J. Puglisi, William F. Smyth, Andrew
Turpin, and Stan Jarzabek, Efficient token based clone detection
with flexible tokenization, Proceedings of the ESEC/FSE'07, the
ACM SIGSOFT symposium on the foundations of software
engineering, Dubrovnik, Croatia, (2007) 513-516
[9] Yu Dong-qi Peng Xin Zhao Wen-yun, An Automatic
Refactoring Method of Cloned Code Using Abstract Syntax Tree and
Static Analysis, Journal of Chinese Computer Systems, Vol.28.
(2008) 1752-1760.
[10] Krinke, J., Identifying Similar Code with Program Dependence
Graphs, Working Conference on Reverse Engineering. (2001) 301
309.
[11] Jaspar Cahill, James M. Hogan, Richard Thomas, "The Java Metrics
Reporter- An Extensible Tool for 00 Software Analysis," Asia-Pacific
Software Engineering Conference, pp. 507, Ninth Asia-Pacific
Software Engineering Conference (APSEC'02), 2002.
[12] Frank Simon, Frank Steinbrckner, Claus Lewerentz, "Metrics Based
Refactoring," csmr, pp.30, Fifth European Conference on Software
Maintenance and Reengineering, 2001
[13] Baeza-Yates, R., Cunto, W., Manber, U. AND Wu, S., Proximity
matching using fixed-queries trees, Proceedings of the Fifth
Combinatorial Pattern Matching (CPM94), Lecture Notes in
Computer Science, vol. 807, (1994) 198212.
[14] Yianilos, P., Excluded middle vantage point forests for nearest
neighbor search, DIMACS Implementation Challenge, ALENEX99,
Baltimore, Md, 1999
[15] Michalis Xenos, D. Stavrinoudis, K. Zikouli, and D. Christodoulakis.,
Object-oriented metrics - a survey, Proceedings of the FESMA
2000, Federation of European Software Measurement Associations

You might also like