0% found this document useful (0 votes)
13 views11 pages

Scalable Detection of Semantic Clones

The paper presents a scalable algorithm for detecting semantic clones in code, which are fragments that share similar meanings despite syntactic differences. It introduces an extended definition of code clones based on Program Dependence Graphs (PDGs) and demonstrates the algorithm's effectiveness through empirical evaluations on large code bases. The proposed method significantly improves the detection of semantically interesting clones compared to existing techniques that focus solely on syntactic similarities.

Uploaded by

Amit Garg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views11 pages

Scalable Detection of Semantic Clones

The paper presents a scalable algorithm for detecting semantic clones in code, which are fragments that share similar meanings despite syntactic differences. It introduces an extended definition of code clones based on Program Dependence Graphs (PDGs) and demonstrates the algorithm's effectiveness through empirical evaluations on large code bases. The proposed method significantly improves the detection of semantically interesting clones compared to existing techniques that focus solely on syntactic similarities.

Uploaded by

Amit Garg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Singapore Management University

Institutional Knowledge at Singapore Management University


Research Collection School Of Information Systems School of Information Systems

5-2008

Scalable detection of semantic clones


Mark GABEL
University of California, Davis

Lingxiao JIANG
Singapore Management University, [email protected]

Zhendong SU
University of California, Davis

Follow this and additional works at: http://ink.library.smu.edu.sg/sis_research


Part of the Software Engineering Commons

Citation
GABEL, Mark; JIANG, Lingxiao; and SU, Zhendong. Scalable detection of semantic clones. (2008). ICSE '08: ACM/IEEE 30th
International Conference on Software Engineering: 10-18 May 2008, Leipzig, Germany. 321-330. Research Collection School Of
Information Systems.
Available at: http://ink.library.smu.edu.sg/sis_research/934

This Conference Proceeding Article is brought to you for free and open access by the School of Information Systems at Institutional Knowledge at
Singapore Management University. It has been accepted for inclusion in Research Collection School Of Information Systems by an authorized
administrator of Institutional Knowledge at Singapore Management University. For more information, please email [email protected].
Scalable Detection of Semantic Clones ∗

Mark Gabel Lingxiao Jiang Zhendong Su


Department of Computer Science
University of California, Davis
{mggabel,lxjiang,su}@ucdavis.edu

ABSTRACT General Terms


Several techniques have been developed for identifying similar code Languages, Algorithms, Experimentation
fragments in programs. These similar fragments, referred to as
code clones, can be used to identify redundant code, locate bugs,
or gain insight into program design. Existing scalable approaches
Keywords
to clone detection are limited to finding program fragments that Clone detection, program dependence graph, software maintenance,
are similar only in their contiguous syntax. Other, semantics-based refactoring
approaches are more resilient to differences in syntax, such as re-
ordered statements, related statements interleaved with other un- 1. INTRODUCTION
related statements, or the use of semantically equivalent control
Considerable research has been dedicated to methods for the de-
structures. However, none of these techniques have scaled to real
tection of similar code fragments in programs. Once located, these
world code bases. These approaches capture semantic informa-
fragments, or clones, can be used in many ways. Clones have been
tion from Program Dependence Graphs (PDGs), program represen-
used to gain insight into program design, to identify redundant code
tations that encode data and control dependencies between state-
to use as candidates for refactoring, and to be analyzed for consis-
ments and predicates. Our definition of a code clone is also based
tent usage for the purpose of bug detection.
on this representation: we consider program fragments with iso-
D ECKARD [9], CP-Miner [14], CCFinder [10], and CloneDR [3]
morphic PDGs to be clones.
represent the most mature clone detection techniques. These tools
In this paper, we present the first scalable clone detection algo-
share several common characteristics. Each tool locates syntactic
rithm based on this definition of semantic clones. Our insight is the
clones, and each has been shown to scale to millions of lines of
reduction of the difficult graph similarity problem to a simpler tree
code. Under empirical evaluation, each tool has been shown to
similarity problem by mapping carefully selected PDG subgraphs
locate comparable numbers of clones.
to their related structured syntax. We efficiently solve the tree sim-
By operating on token streams and syntax trees, these techniques
ilarity problem to create a scalable analysis. We have implemented
locate clones that are resilient to minor code modifications, such
this algorithm in a practical tool and performed evaluations on sev-
as the changing of types or constant values. This resilience gives
eral million-line open source projects, including the Linux kernel.
these tools some modicum of semantic awareness: two program
Compared with previous approaches, our tool locates significantly
fragments may differ in their concrete syntax, but the normalizing
more clones, which are often more semantically interesting than
effects of the respective clone tools allow the detection of their se-
simple copied and pasted code fragments.
mantic similarity.
The sets of clones located by each of these tools are fundamen-
tally limited by the working definition of a code clone. Each tool
Categories and Subject Descriptors is capable of finding clones solely within a program’s contiguous,
D.2.7 [Software Engineering]: Distribution, Maintenance, and structured syntax. Certain interesting clones can elude detection:
Enhancement—restructuring, reverse engineering, and reengineer- these tools are sensitive to even the most simple structural differ-
ing ences in otherwise semantically similar code. These structural dif-
ferences can include reordered statements, related statements inter-
∗ leaved with other unrelated statements, or the use of semantically
This research was supported in part by NSF CAREER Grant No.
0546844, NSF CyberTrust Grant No. 0627749, NSF CCF Grant equivalent control structures.
No. 0702622, US Air Force under grant FA9550-07-1-0532, and a As a motivating example, consider the code snippet in Figure 1.
generous gift from Intel. The information presented here does not When compared with the listing in Figure 2, the code is similar:
necessarily reflect the position or the policy of the Government and both perform the same overall computation, but the latter snippet
no official endorsement should be inferred.
contains extra statements to time the loop. Current scalable clone
detection techniques are unable to detect these interleaved clones.
Permission to make digital or hard copies of all or part of this work for While detecting true semantic similarity is undecidable in gen-
personal or classroom use is granted without fee provided that copies are eral, some clone detection techniques have attempted to locate clones
not made or distributed for profit or commercial advantage and that copies with a less strict, semantics preserving definition of similar code.
bear this notice and the full citation on the first page. To copy otherwise, to Rather than scanning token sequences or similar subtrees, these
republish, to post on servers or to redistribute to lists, requires prior specific
techniques have operated on program dependence graphs [6], or
permission and/or a fee.
ICSE’08, May 10–18, 2008, Leipzig, Germany. PDGs. A PDG is a representation of a procedure in which the nodes
Copyright 2008 ACM 978-1-60558-079-1/08/05 ...$5.00. represent simple statements and control flow predicates, and edges
1 int func(int i, int j) { entry
func()
2 int k = 10;
4 while (i < k) { body
5 i++; formal-in formal-in decl func()
int i int j int k
6 } Key:

statement node

8 j = 2 * k; control point node

expr data dependency


10 printf("i=%d, j=%d\n", i, j); k = 10
control dependency
11 return k;
12 }
expr
Figure 1: Example code listing. ctrl-pt
i<k
expr
i++
j=2*k call-site
printf()

1 int func_timed(int i, int j) {


2 int k = 10;
4 long start = get_time_millis(); expr
return k
5 long finish; actual-in
actual-in actual-in
“i=%d,
return i j j=%d”
exit
7 while (i < k) { return k

8 i++; formal-out
func()
9 }
11 finish = get_time_millis(); Figure 3: The PDG for Figure 1.
12 printf("loop took %dms\n", finish − start);
tation of our definitions and algorithm (Section 3). We then dis-
14 j = 2 * k; cuss our implementation (Section 4) and present the results of our
empirical evaluation (Section 5). Finally, we discuss related work
16 printf("i=%d, j=%d\n", i, j);
17 return k; (Section 6) and conclude with ideas for future work (Section 7).
18 }

Figure 2: A similar example code listing. 2. BACKGROUND


Our algorithm augments an existing clone detection technique,
encode data and control dependencies. PDG-based similarity de- D ECKARD [9], with semantic information derived from program
tection tools have all used some variant of subgraph isomorphism to dependence graphs (PDGs). This section provides the necessary
detect either similar procedures or code fragments [12, 16]. These background on both program dependence graphs and D ECKARD’s
computations are particularly expensive, and each technique has vector based clone detection.
not been shown to scale to even moderately-sized code bases.
In this paper, we introduce an extended definition of code clones, 2.1 Program Dependence Graphs
based on PDG similarity, that captures more semantic information A program dependence graph [6] (PDG) is a static representa-
than previous approaches. We then provide a scalable, approximate tion of the flow of data through a procedure. It is commonly used
algorithm for detecting these clones. We reduce the difficult graph to implement program slicing [20]. The nodes of a PDG consist
similarity problem to a simpler tree similarity problem by creating a of program points constructed from the source code: declarations,
mapping between PDG subgraphs and their related structured syn- simple statements, expressions, and control points. A control point
tax. Specifically, we make the following technical contributions: represents a point at which a program branches, loops, or enters or
1. We extend the definition of a code clone to include seman- exits a procedure and is labeled by its associated predicate.
tically (but not necessarily syntactically) related code frag- A PDG models the flow of data through a procedure. In effect,
ments. Our definition is a generalization of previous syntac- the PDG abstracts away many arbitrary syntactic decisions a pro-
tic clone definitions, and it thus defines a superset of previ- grammer made while constructing a function. For example, any
ously defined clones. possible arbitrary interleaving of unrelated statements within a pro-
cedure yields precisely the same PDG.
2. We introduce an approximate algorithm for detecting these The edges of a PDG encode the data and control dependencies
clones that scales to millions of lines of code. Our algo- between program points. Given two program points p1 and p2 ,
rithm is based on a reduction of deliberately selected PDG there exists a directed data dependency edge from p1 to p2 if and
subgraphs to abstract syntax tree forests. We then utilize an only if the execution of p2 depends on data calculated directly by
existing, tree-based detection technique [9] to locate clones. p1 . For example, consider the statements on lines 2 and 8 of the
listing in Figure 1. The second statement calculates a value that is
3. We implement a practical tool based on our algorithm and
initialized in the first. This dependency is illustrated by a directed
perform an extensive empirical evaluation. Our tool is capa-
edge between the two nodes in Figure 3.
ble of scanning large, real-world C and C++ projects. Com-
Note that the node corresponding to the formal parameter j does
pared with previous approaches, our tool locates significantly
not have any outgoing edges. This accurately reflects the fact that j
more clones, which are often more semantically interesting
is redefined without ever being used at line 8 in the listing.
than simple copied and pasted code fragments.
The incrementing of i on line 5 also presents an interesting case.
The rest of this paper is structured as follows. We begin with a Because an increment constitutes both a use and a definition, the
discussion of background information on components of our anal- node in the PDG corresponding to i++ has both a self data depen-
ysis (Section 2). The body of our work continues with the presen- dency loop and outgoing data dependency edges.
block
2,2,6,1,1,1,1,1,1,2,1,5,1,3,1,3,1

stmt- stmt-
init while return
expr expr
0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0 1,0,2,0,0,1,0,0,0,0,1,2,0,1,0,1,1 0,1,1,0,1,0,1,0,0,1,0,1,0,1,0,1,0 0,1,2,1,0,0,0,0,0,0,0,2,0,1,1,0,0 0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0

init- function- expr-


ilt block iassign
constant call variable
1,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,0 0,1,1,0,1,0,1,0,0,1,0,1,0,0,0,1,0 0,1,2,1,0,0,0,0,0,0,0,2,0,0,1,0,0 Key:

expr- expr- stmt- para- expr- expr- expr- variable block atomic node
integer imultiply
variable variable expr meter constant variable variable
string child node

para- ipost- expr- expr- para- para-


variable string 0,0,0,0,0,0 subtree vector
meter incr constant variable meter meter

1,1,3,0,1,1,1,1,1,2,1,3,0,2,0,2,1 1,2,5,1,1,1,1,0,0,1,1,5,0,3,1,2,1 0,2,4,1,1,0,1,0,0,1,0,3,1,2,1,2,0


0,0,0,0,0,0 merged vector
para-
integer variable
meter

Figure 4: The AST for Figure 1 with characteristic vectors.


Similarly, there exists a directed control dependency edge from the trees is k, then the Euclidean distance between the vectors is
p1 to p2 if and only if the choice to execute p2 depends on the test no more than (4q − 3)k [21]. This characterization allows the
in p1 . The while loop on line 4 of the listing illustrates the use of tree-based clone detection problem to be reformulated as a near-
control dependency edges flowing from a control point node. The est neighbor problem on numerical vectors.
corresponding PDG node in Figure 3 is labeled with the guard ex- For the application of tree-based clone detection, we use 1-level
pression, and there is a control dependency edge to the enclosed atomic patterns. The actual domain of atomic patterns is given by
increment statement. Note that this node also has a control depen- the various node types defined in the grammar for a language that
dency self loop. This is indicative of a looping structure: if we the user deems “significant.” For parse trees, insignificant nodes
replace while with if, we need only remove this one self looping might include semicolons and brackets, and significant nodes might
control edge to yield the new PDG. include expressions, operators, and statements.
Callsites are modeled as control points that control the execu- D ECKARD focuses on parse trees, but we have adapted the al-
tion of expressions corresponding to the calculation of the actual gorithm to function on abstract syntax trees. The algorithm is es-
parameters and the assignment of the return value (or assignments sentially the same, but a few changes were necessary. This will be
to out parameters). The call to printf on line 10 is modeled in this discussed in Section 4.
way: there are three outgoing control edges that connect to the three D ECKARD first generates vectors that effectively cover an entire
parameters. tree. This is done in two distinct phases. The first phase traverses
PDGs may also contain implicit nodes that do not have a di- the tree in postorder and generates vectors for each “significant”
rect source correspondence. These include entry, exit, and function subtree, where relevance is a heuristic setting that marks a node
body control points, and are represented by a light shade in Fig- type as being a suitable parent node. In the original implemen-
ure 3. These nodes are used to connect PDGs and form a larger, tation, “significant” nodes included statements, expressions, and
interprocedural structure that is sometimes referred to as a system declarations. In our current, AST-based implementation, we have
dependence graph (SDG) [8]. Because our work focuses on in- the luxury of more semantic information from the AST class hier-
traprocedural dependencies only, we simplify our graphs by omit- archy. We define “significant” nodes to be those that descend from
ting these nodes. the parent “statement” class.
Figure 4 depicts a simplified abstract syntax tree for the code list-
2.2 Scalable Tree-based Clone Detection ing in Figure 1. The subtree vectors appear below each significant
The foundation of our analysis is the D ECKARD clone detection node. For each node, D ECKARD first creates a vector that consists
tool. D ECKARD implements a tree-similarity based technique that of the sum of the node’s children’s vectors. Then, it increments
uses the idea of characteristic vectors to efficiently match clone the value in the current node’s index position if the current node is
pairs. This section provides a brief review of this technique. A significant.
more detailed discussion can be found in the full paper [9]. The second phase consists of moving a sliding window along ad-
A characteristic vector is a numerical approximation of a par- jacent subtrees and merging the subtree vectors. This allows groups
ticular subtree. The dimension of the vectors is uniform and is of contiguous statements or expressions to be grouped into a single
determined by the total number of possible types of q-level atomic vector for matching. Configuration options prevent merges that are
patterns deemed relevant to approximate a given tree, where a q- not likely to be useful, like the merging of the tail end of a block
level atomic pattern is a complete binary tree of height q with tree with the head of an adjacent block. Figure 4 contains merged vec-
node labels for identification. In our context, these node labels cor- tors that were generated with a sliding window size of three.
respond to terminals and non-terminals in the grammar for a lan- At this point, the tree has been reduced to a set of points in the
q
guage. The maximum number of q-level atomic patterns is L2 −1 Euclidean space. To efficiently cluster large numbers of vectors,
if the number of possible labels (including the empty label) is L. D ECKARD uses Locality Sensitive Hashing [7], an efficient approx-
Each characteristic vector is a point c1 , ..., cn  in n-dimensional imate near-neighbor solver. When combined with a lossless parti-
Euclidean space, where n is the number of distinct q-level atomic tioning of the vectors based on their size [9], the LSH engine is
patterns. Each ci counts the number of occurrences of the atomic capable of enumerating clone groups from millions of vectors in a
pattern represented by index i. One important property of char- few minutes.
acteristic vectors is that given two subtrees, T1 and T2 , and their Using these approximations, D ECKARD is able to enumerate a
respective q-level vectors, v1 and v2 , if the edit distance between comprehensive set of clone groups over millions of lines of code in
tens of minutes. In practice, the use of vector approximations and For example, consider the code snippet in Figure 1. Suppose we
an approximate nearest neighbor solver does not affect the quality have the PDG subgraph corresponding to all lines that reference the
of the results; the false positive rate is extremely low. variable i, i.e., the graph consisting of the nodes (from Figure 3)
“int i,” “i < k,” “i++,” and the actual parameter to the printf call.
We map each of these nodes to their structured syntax. The
3. ALGORITHM DESCRIPTION actual parameter maps to the subtree corresponding to the call of
The vectors generated during the first phase of D ECKARD’s exe- printf, which is the “function-call” subtree in Figure 4 (we only
cution provide a high degree of coverage of the syntactic structure consider subtrees that can be syntactically separated).
of a given program. Our approach involves augmenting D ECKARD’s The incrementing of i corresponds to the “ipost-incr” subtree,
vector generation phase with a third pass: the generation of vectors and the control point (i < k) corresponds to the “while” subtree.
for semantic clones. We then use the same LSH-based clustering Because the “ipost-incr” subtree is subsumed by the “while” sub-
technique to solve the near-neighbor problem and generate clone tree, we include only the latter in the syntactic image.
reports. Mapping a PDG subgraph to an AST forest effectively reduces
the graph similarity problem to an easier tree similarity problem
3.1 Definitions that we can solve efficiently using D ECKARD.
Considering once again the motivating example in Figure 2, we yields something that we can match very efficiently, both par-
notice that the computation of data is identical to that of the code in tially and fully, using D ECKARD’s vector generation. This rela-
Figure 1. However, purely syntactic definitions of code clones do tionship to syntax effectively reduces the graph similarity problem
not capture this relationship. Syntactic definitions of code clones to an easier tree similarity problem.
are defined similarly: The overall architecture is shown in Figure 5. At a high level,
our algorithm functions as follows:
Definition 3.1 (Syntactic Code Clone) Two disjoint, contiguous se-
quences of program syntax S1 and S2 are code clones if and only 1. We run D ECKARD’s primary vector generation. Subtree and
if δ(S1 , S2 ). sliding window vectors efficiently provide contiguous syn-
tactic clone candidates for the entire program.
In this general definition, the precise form of the syntax is not
described, and δ refers to a similarity function. CP-Miner uses a 2. For each procedure, we enumerate a finite set of significant
distance metric on token streams called gap, CloneDR uses a size- subgraphs; that is, we enumerate subgraphs that hold seman-
sensitive definition on trees referred to as Similarity, and D ECKARD tic relevance and are likely to be good semantic clone can-
uses tree edit distance. didates. These algorithms are discussed in Section 3.3. In
We expand this definition to include non-contiguous but related short, we produce subgraphs of maximal size that are likely
code. Assume we have a mapping function, ρ, that maps a sequence to represent distinct computations.
of syntax (of arbitrary type) to a PDG subgraph.
3. For each subgraph G, we compute μ(G) to generate an AST
Definition 3.2 (Semantic Code Clone) Two disjoint, possibly non- forest.
contiguous sequences of program syntax S1 and S2 are semantic
4. We use D ECKARD’s sliding window vector merging to gen-
code clones if and only if S1 and S2 are syntactic code clones or
erate a complete set of characteristic vectors for each AST
ρ(S1 ) is isomorphic to ρ(S2 ).
forest.
Applying this relaxed definition to our example allows us to con- 5. We use LSH to quickly solve the near-neighbor problem and
sider a subset of the code in Figure 2 as a candidate for clone detec- enumerate the clone groups. As before, we apply a set of
tion. In this instance, we disregard the timing code. The remaining post-processing filters to remove spurious clone groups and
subset is a syntactic clone with the body of Figure 1, and their as- clone group members.
sociated PDGs are identical.
If a semantic vector is a member of a clone group, then the flow
3.2 High-Level Algorithm of data represented by its syntactic members is duplicated by the
There are difficulties with locating these semantic clones in a other members of the clone group—each of which can come from
scalable manner. First, there is a combinatorial explosion of pos- any phase of vector generation. That is, a given semantic vector
sible clones. Second, although graph isomorphism testing may be can match either:
feasible for small, simple PDGs, it is computationally expensive
• a complete AST subtree,
in general. Any method that would require pairwise comparisons
would not scale. • a sequence of contiguous statements, or
Thus far, no scalable algorithm exists for detecting semantic clones. • another semantic vector: a slice of another procedure.
We present a scalable, approximate technique for locating semantic
clones based on the fact that both structured syntax trees and depen-
3.3 PDG Subgraph Selection
dence graphs are derived from the original source code. Because of Our algorithm generates vectors over a finite set of subgraphs
this relationship, we are able to construct a mapping function that of a given procedure’s PDG. This section details our definitions of
locates the associated syntax for a given PDG subgraph. We refer interesting subgraph sets and our algorithms for enumerating them.
to this associated syntax as the syntactic image. For compatibil-
ity with D ECKARD’s tree-based clone detection, we map to AST
3.3.1 Weakly Connected Components
forests. Consider two statements, s1 and s2 , that are contained in a single
given procedure with PDG P . As discussed in Section 2.1, s1 and
Definition 3.3 (Syntactic Image) The syntactic image of a PDG s2 have an implied relationship (a data or control dependency) if
subgraph G, μ(G), is the maximal set of AST subtrees that corre- there exists a path between them. However, the absence of a path
spond to the concrete syntax of the nodes in G. The set is a domi- does not imply that the statements are completely unrelated: the
nating set, i.e., for all pairs of trees T, T  ∈ μ(G), T  T  . two statements could both influence a third, common statement.
struct file_stat *compute_statistics() {
PDG struct file_stat *result = malloc(sizeof(struct
PDG
Subgraphs file_stat));
AST
int avg_temp_file_size = 0;
int avg_data_file_size = 0;
/* iterate the temp files */
...
/* iterate the data files */
Deckard Vector Syntactic Image
Generation Mapping
...
/* avg results and store in avg_temp_file_size */
...
/* avg results and store in avg_data_file_size */
...
Characteristic Vectors result−>temp_size = avg_temp_file_size;
result−>data_size = avg_data_file_size;
return result;
}
Figure 6: Semantic thread example.
LSH Clustering Clones
of v. In our work, we adopt a simplified definition, which assumes
that all slices from point s are computed with respect to all variables
Figure 5: Semantic clone detection algorithm. used or defined in s.
This type of static intraprocedural slicing is straightforward when
Consider our earlier example in Figure 1. Statements 5 (increment) given a dependence graph. Forward slices are defined by graph
and 8 (simple assignment) do not have a connecting path in the connectivity in the PDG: each forward edge encodes a specific data
PDG, but they are still tied together by their common use at the call or control dependency.
site on line 10.
Note, though, that these indirect relationships are characterized Definition 3.4 (Forward Slice) Let G be the PDG of procedure P ,
by undirected paths in the PDG. We say that two statements are and let s be a statement in P . The static intraprocedural forward
unrelated if there does not exist any undirected path between them slice from s over P , f (s), is defined as the set of all nodes reach-
in the PDG, i.e., each resides in separate weakly connected compo- able from s in G.
nents.
A natural but conservative choice for a set of interesting PDG In the above example, two forward slices from the declarations at
subgraphs is to cluster the graph into weakly connected compo- the second and third lines yield two distinct but interleaving threads
nents. Statements that are separated are definitely unrelated, and of computation that intersect at one statement.
statements that are clustered have a semantic relationship. These We wish to enumerate these potentially overlapping flows of data
subgraphs are especially interesting if their respective sets of state- for each procedure, which we refer to as semantic threads.
ments are interleaved when viewed in the context of the concrete Definition 3.5 (Semantic Thread) A semantic thread of a proce-
syntax of the procedure. dure P is either a forward slice f (s) or the union of one or more
Practically, this can be implemented with a series of linear-time forward slices.
graph searches. As previously noted, we operate over our sim-
plified PDGs: implicit entry and exit nodes are omitted. Without Simply enumerating the set of forward slices of a procedure yields
breaking the implicit control dependence of function entry and exit, redundant data. For example, some slices may fully subsume oth-
every procedure would have exactly one weakly connected compo- ers. It is clear from our definition that
nent.
∀s1 , s2 ∈ P. s1 ∈ f (s2 ) ⇒ f (s1 ) ⊆ f (s2 )

3.3.2 Semantic Threads In addition, although some overlapping among different slices
should be allowed (and even desired), a significant amount of over-
Although analyzing weakly connected components for clones lapping may imply that these slices might be part of the same higher-
does yield interesting results, many semantic clones may not be level computation. This is especially evident in forward slices from
detected this way. Thus, for better coverage, we need a more fine- consecutive, related declarations:
grained partitioning of the statements in a procedure body. First,
to illustrate the problem, let us consider the following hypothetical int count_list_nodes(struct list_node *head) {
example in Figure 6. int i = 0;
struct list_node *tail = head−>prev;
It is clear that there are two distinct flows of data throughout the
function, which only merge at the end. However, the aggregation while (head != tail && i < MAX) {
of the two calculations (through “return result”) causes the entire i++;
function body to be grouped as a single component. The PDG sub- head = head−>next;
graphs corresponding to these computations overlap, but only at the }
end when the results are returned.
One way of modeling this particular flow of data is through the return i;
}
concept of a forward slice [8], which is a specific variant of pro-
gram slicing [20]. A forward slice from a program point s with The forward slices from the declarations on the first and second
respect to a variable v consists of all program points that may be lines differ only in their respective first nodes. Considering these to
directly or indirectly affected by the execution of s with the value be separate computations would be a mistake: we not only create
Algorithm 1 Construct Semantic Threads and the domain of semantic threads consists of non-empty unions
1: function BST(P : PDG, γ : int): STs of forward slices, it follows that:
2: IST, seen ← ∅
3: Sort nodes in P in asc. order w.r.t. their locs. in source code. |f (s1 ) ∩ f (s2 )| > γ ⇒ ∃ T ∈ IST (P, γ).f (s1 ) ∪ f (s2 ) ⊆ T
4: for all node n in P do
5: if n ∈/ seen then
Similarly, if two slices must be combined, then any thread that
6: slice ← DepthFirstSearch(n, P ) conflicts with this combined thread must also be combined.
7: seen ← seen ∪ slice AddSlice implements this process of recursive greedy combi-
8: IST ← AddSlice(IST, slice, γ) nation exactly.
9: end if
10: end for Theorem 3.8 (Maximality) The set of interesting semantic threads
11: return IST returned by BST is maximal.
12: end function
13: P ROOF. Define BST (P, γ) to be the set returned by Algorithm 1.
14: function A DD S LICE(IST : STs, slice : ST, γ : int): STs Assume that there exists another set, IST , that meets all require-
15: conf licts ← ∅ ments and is strictly larger than this set.
16: for all thread T in IST do
17: if |slice ∩ T | > γ then BST (P, γ) = {T1 , . . . , Tn }
18: conf licts ← conf licts ∪ {T } ˘ ¯
19: end if IST (P, γ) = T1 , . . . , Tn , . . . , Tm


20: end for Because each set contains no fully subsumed threads, it follows that
21: if conf licts = ∅ then
22: return IST ∪ {slice}
there exists at least one “head node,” h1 , . . . , hn and h1 , . . . , hm
23: else for each semantic thread. By the pigeonhole principle,
S
24: slice ← slice ∪ T ∈conf licts T
∃ i, j, k. hi ∈ Tk ∧ hj ∈ Tk
25: return AddSlice(IST \ conf licts, slice, γ)
26: end if hi and hj are associated with unique forward slices that do not fully
27: end function subsume each other. By Lemma 3.7, because AddSlice com-
bined these two slices (f (hi ) ∪ f (hj ) ⊆ Tk ), they must be com-
redundant data, but we also fail to recognize the larger semantic
bined in every set that meets our requirements. Because these slices
thread and may miss important clones.
are separated in IST , IST does not meet our requirements—a
We define a set, IST (P, γ), that consists of our set of interesting
contradiction.
γ-overlapping semantic threads. These subgraphs represent our
candidates for possible semantic clones. In the worst case, the algorithm’s execution time is cubic in the
number of nodes of a given procedure’s PDG. In practice, this is
Definition 3.6 (Interesting Semantic Threads) The set of inter-
not a problem. The size of a given PDG is usually small, in the
esting γ-overlapping semantic threads is a finite set of semantic tens of nodes, and the number of non-subsumed forward slices is
threads with the following properties:
considerably less.
1. The set is complete; its union represents the entire PDG. The problem also scales gracefully in the sense that procedure
sizes are generally bounded: larger code bases have more proce-
2. The set must not contain any fully subsumed threads: dures, not necessarily larger procedures. Finally, our empirical re-
sults show that the execution time of this algorithm is inconsequen-
 sl, sl ∈ IST (P, γ).sl ⊆ sl tial (Section 5).
3.3.3 Empirical Study
3. Any two threads in the set share at most γ nodes.
This section contains a brief evaluation of the occurrence of weakly
∀sl, sl ∈ IST (P, γ).|sl ∩ sl | ≤ γ connected components and semantic threads in real systems. We
evaluated five open source projects: The GIMP, GTK+, MySQL,
4. IST (P, γ) is maximal, i.e., it has the maximal size of all sets PostgreSQL, and the Linux kernel (these same projects are ana-
that meet properties 1-3. lyzed for clones in Section 5). Figure 7 contains the numbers of
weakly connected components per project.
With γ set to one, the first code example has two semantic threads.
Note that setting γ to zero is precisely equivalent to computing Procedures
Procedures with n WCCs
weakly connected components. 1 2 3 4 5+
Algorithm 1 is a simple greedy algorithm for computing this GIMP 13337 7255 3498 1255 627 702
set. The function AddSlice ensures that the final set contains GTK 13284 8773 2967 763 348 433
MySQL 14408 5419 6134 1450 616 789
no threads that overlap by more than γ nodes, and the enumeration PostgreSQL 9276 4105 3290 1033 335 513
of each node in the PDG implies the completeness of the returned Linux 136480 60533 52771 13273 5094 4809
set. The following argues that Algorithm 1 produces a maximal set. Figure 7: Number of weakly connected components.
Lemma 3.7 If AddSlice combines two slices, then they must be
combined in any set that satisfies the definition of interesting se- We noted that each project contains a significant number of pro-
mantic threads. cedures with more than one weakly connected component. This
suggests that there are functions in real systems that do in fact per-
P ROOF. Consider a procedure P and two statements, s1 and s2 . form separate computations. Not all of these computations are nec-
Assume that |f (s1 ) ∩ f (s2 )| > γ. Their individual presence in the essarily interleaved; they could be sequential, and may not repre-
final set would clearly violate property 3. sent new targets for clone detection.
Let IST (P, γ) be an arbitrary set that meets our requirements. Figure 8 counts the number of procedures that contain non-trivial
Because every node must be included in at least one semantic thread weakly connected components (γ = 0) and γ = 3 semantic threads.
Procs w/non-triv. Procs w/non-triv. pass increases the sliding window size by a multiplicative factor,
Procedures
γ = 0 STs γ = 3 STs
which we have set at 1.5. The sliding window phase terminates
GIMP 13337 903 3008 when the minimum vector size exceeds the size of the procedure.
GTK 13284 697 2380
MySQL 14408 1618 2441 We apply this exponential sliding window when generating vectors
PostgreSQL 9276 1221 2267 over semantic threads as well.
Linux 136480 10609 22514
Figure 8: Number of procedures with non-trivial semantic threads. 4.2 Other Implementation Considerations
Our greatest limitation is that we must have compilable code to
Non-trivial semantic threads include interleaved sequences of re- retrieve ASTs and PDGs, and only the compiled code is reflected in
lated code that cannot be detected by current scalable clone detec- these structures. At present, we do not have a way to scan code that
tion techniques. is deleted by the preprocessor before compilation (other than run-
Overall, we are able to consider a significant number of new ning multiple builds with different settings). To mitigate this prob-
clone candidates. In addition, the concept of γ-overlapping seman- lem for our evaluated projects, we set the configuration to maximize
tic threads allows us to extend our search to a far greater number of the amount of compiled code whenever possible. For example, our
potential clone candidates. Linux configuration builds every possible kernel option and builds
modules for every driver.
4. IMPLEMENTATION The construction of PDGs is not a trivial task, and it presents
This section describes the implementation of our tool. It consists scaling issues in its own right. CodeSurfer facilitates this process
of three primary components: AST and PDG generation, vector by offering numerous options that control the precision of the PDG
generation, and LSH clustering. To generate syntax trees and de- build. In order to build PDGs for projects on the million line scale,
pendence graphs, we use Grammatech’s CodeSurfer1 , which allows we were forced to disable precise alias analysis on all builds. This
us to analyze both C and C++ code bases. We output this data to undoubtedly leads to some imprecision in the final graphs, and we
a proprietary format using a Scheme script that utilizes Grammat- could potentially produce multiple semantic threads where only
ech’s API. one truly exists. This may cause our tool to miss certain seman-
This raw data is read and used by a Java implementation of tic clones, but it does not cause false positives.
D ECKARD’s vector generation engine. This component also per- With the addition of the semantic vector phase, we have the po-
forms the syntactic image mapping and semantic vector generation. tential to generate many duplicate vectors. This is not a problem in
The LSH clustering back-end of D ECKARD is used without modi- practice. First, we take advantage of the intraprocedural model and
fication. buffer all vectors before printing them, conservatively removing the
likely duplicates as they are added. The comparatively small size
4.1 Implementation Details of a single procedure lets these linear algorithms run quickly.
D ECKARD’s vector generation engine previously operated over Second, the LSH back-end is robust against these extra vectors:
parse trees. We reimplemented the algorithm to generate vectors they merely show up as duplicates in true clone groups or as spu-
for abstract syntax trees. In the process, we made several core im- rious clone groups. Our post-processing engine quickly removes
provements. these.
In order to efficiently utilize our dual core machines, we made Third, we exploit the fact that there is a correlation between the
the vector generation phase parallel using Java’s concurrency API. number of PDG nodes and the number of AST statement nodes. In
At present, we use a procedure as a single unit of work. The tasks the semantic vector phase, we size γ (the overlap constant) to be
are inherently independent: generating the vectors for a procedure strictly smaller than our minimum vector size.
does not require any data outside of the procedure. Our paral-
lel Java implementation generated vectors faster than we expected 5. EMPIRICAL EVALUATION
(Section 5). We evaluated the effectiveness of our tool on five open source
The move to ASTs also posed a challenge. Unlike token-based projects: The GIMP, GTK+, MySQL, PostgreSQL, and the Linux
parse trees, setting the minimum size for a vector was not intu- kernel. The evaluation was performed against D ECKARD, the state-
itive. While 30 tokens (D ECKARD’s default) usually map to about of-the-art tool for detecting syntactic clones. In this section, we
three statements, 30 AST nodes could map to either fewer (less than also present examples of new classes of detectable clones.
one) or many more. Instead of judging a vector’s size on its magni-
tude, we utilize the additional semantic information from the AST 5.1 Experimental Setup
type hierarchy to judge vectors based on the number of contained
We performed our evaluation on a Fedora 6/64bit workstation
statement nodes. We modified both the subtree and sliding window
with a 2.66GHz Core 2 Duo and 4GB of RAM. We used CodeSurfer
phases to use this new measure.2
2.1p1 and the Sun Java 1.6.0u1 64-bit server VM. To set up the data
One challenge the original D ECKARD faced was the coverage
for analysis, we first maximized the build configuration of each
of all interesting combinations of statements. The coverage was
project. We then allowed CodeSurfer to build the PDGs and dump
affected by three parameters: the minimum vector size, the size of
the information to a file. Figure 9 lists the approximate project sizes
the sliding window, and the sliding window’s stride, or how often it
and build times for our test targets. The size metric is approximate;
outputs vectors. Because the sliding window is now sized on state-
all whitespace is counted.
ment nodes, we can permanently set the stride to one and output all
interesting vectors: each new vector has at least one new statement.
Size (MLoc) PDG Build Time PDG Dump Time
Instead of operating in a single pass over a fixed vector size,
we scan several times, starting at the minimum vector size. Each GIMP 0.78 25m 57s 20m 40s
GTK 0.88 12m 50s 16m 54s
1
http://www.grammatech.com MySQL 1.13 16m 56s 12m 36s
2 PostgreSQL 0.74 9m 12s 21m 48s
As a usability improvement, we can also set the measure to use Linux 7.31 296m 1s 241m 4s
lines of code. This can cause issues with multiline statements,
though. Figure 9: Project sizes and AST/PDG build times.
The PDG builds—especially for the Linux kernel—are particu- 1 static void zc0301_release_resources(struct zc0301_device* cam)
larly expensive. When viewed in the context of other PDG based 2 {
detection approaches that use subgraph isomorphism testing, though, 3 DBG(2, "V4L2 device /dev/video%d deregistered"
, cam−>v4ldev−>minor);
the build times are quite reasonable. In addition, this cost is in- 4 video_set_drvdata(cam−>v4ldev, NULL);
curred once per project: repeated runs of our tool reuse the same 5 video_unregister_device(cam−>v4ldev);
input data. 6 kfree(cam−>control_buffer);
7 }
5.2 Performance
Through our testing, we observed that requiring a minimum state- 1 static void sn9c102_release_resources(struct sn9c102_device* cam)
2 {
ment node count of 8 produces clones similar in size to D ECKARD’s 3 mutex_lock(&sn9c102_sysfs_lock);
minimum token count of 50. Figure 10 shows the execution times
for both our semantic clone detection algorithm and our tree-based 5 DBG(2, "V4L2 device /dev/video%d deregistered"
algorithm. , cam−>v4ldev−>minor);
6 video_set_drvdata(cam−>v4ldev, NULL);
7 video_unregister_device(cam−>v4ldev);
AST Only AST/PDG
VGen Cluster VGen Cluster 9 mutex_unlock(&sn9c102_sysfs_lock);
10 kfree(cam−>control_buffer);
GIMP 0m37s 1m11s 0m44s 1m45s 11 }
GTK 0m31s 0m57s 0m34s 0m53s
MySQL 0m27s 1m16s 0m29s 1m34s
PostgreSQL 0m40s 1m50s 0m51s 2m30s Figure 12: Two semantic clones differing only by global locking
Linux 8m42s 6m1s 9m48s 7m24s (Linux).
Figure 10: Clone detection times.
This low false positive rate is possibly due to the relatively large
magnitude of AST-based vectors: the Linux kernel code contained
In this table, the VGen phase performs all vector generation. For
(after macro expansion) an average of 30 AST nodes per line. These
both the tree-only and the tree/PDG modes, this includes the sub-
larger vectors create a more unique signature for each line of code
tree and sliding window phases. In addition, the AST/PDG mode
that is less likely to incidentally match a non-identical line.
enumerates both the weakly connected components and the γ = 3
semantic threads (Algorithm 1) and enumerates their respective 5.3 Qualitative Analysis
vectors using the sliding window. The quantitative results show that this technique finds more clones
Semantic vector generation adds surprisingly little to the execu- with a larger average size. However, this new class of analysis de-
tion overhead. We can attribute this to several factors: serves a closer, qualitative look at the results. Semantic clones are
• PDGs are in general significantly (about an order of magni- more interesting than simple copied and pasted or otherwise struc-
tude) smaller than their equivalent ASTs; turally identical code. We have observed programming idioms that
• There are relatively few semantic threads per procedure; and are pervasive throughout the results.
• Our parallel Java implementation allows the utilization of On a general level, our tool was able to locate semantic clones
spare CPU cycles that sat idle during the previously IO-bound that were slightly to somewhat larger than their syntactic equiva-
tree-only phase. lents, which were also found. The semantic clone often contained
the syntactic clone coupled with a limited number of declarations,
Coverage wise, our tool locates more clones than its tree-only initializations, or return statements that were otherwise separated
predecessor. This is expected: we produce exactly the same set from the syntactic clone by unrelated statements. In addition, many
of vectors, then augment it with vectors for semantic clones. In semantic clones were subsumed by larger syntactic clones.
many cases, we observed that the average number of cloned lines We observed cases where our tool was able to locate clone groups
of code per clone group differs significantly between the tree-only that differed only in their use of global locking (Figures 12 and 13).
and semantic versions of the analysis. As we increase the minimum In each case, the tool generated semantic threads for the intrinsic
number of statement nodes for a given clone group, the clones re- calculation as well as the locking. While the locking pattern itself
ported by the semantic analysis tend to cover more lines of code was too small to be considered a clone candidate, the calculations
than those reported by the tree-only analysis. themselves were matched. In Figure 12, we omitted the third mem-
We believe this is because when the minimum vector size is set ber of the clone group due to space restrictions. This third member
to smaller values, the larger semantic clones are detected simul- also had the locking code in place, and each came from very similar
taneously with their smaller, contiguous constituent components. drivers. This lack of locking in one of the three could possibly be
While the semantics-based analysis is able to tie these disparate indicative of a bug.
components together, it does not necessarily increase the cover- Our tool also found clones that differed only by debugging state-
age. When the minimum is raised, these smaller components are ments. One example appears in Figure 14. While we found several
no longer detected as clones. examples of this behavior, we do suspect that we missed other cases
Figure 11 contains our coverage results for the Linux kernel. due to the fact that logging code often displays current state infor-
Line counts are conservative: we count the precise set of lines cov- mation. This places a data dependency on the logging code and
ered by each clone group. Whitespace is ignored, and multi-line causes its inclusion in a larger semantic thread.
statements are usually counted as a single line. We were able to discover specific data access patterns. One ex-
After each of our experiments, we sampled thirty clone groups at ample appears in Figure 15. The pattern consists of the semantic
random and verified their contents as clones. When the minimum thread created by the union of the forward slices of the underlined
number of statement nodes was set to 4, we experienced a false variables. Note that unrelated (data-wise, but perhaps temporally)
positive rate of 2 in 30. These false positives took the form of small statements are interleaved through the pattern-forming code. These
(two to three lines) snippets of code that incidentally mapped to frequency and complexity of these “patterns” implies that they are
identical characteristic vectors. When the minimum was set to 8 or possibly prescriptive and not just coincidental. They could then be
more, we found no false positives in these random samples. used as a specification for bug finding.
Min. Nds. AST Only PDG/AST Min. Nds. AST Only PDG/AST Min. Nds. AST Only PDG/AST
4 935203 940497 4 160934 170544 4 13.9 14.1
8 350804 354079 8 49003 54761 8 15.5 16.2
14 150694 152484 14 16114 18918 14 20.8 22.5
22 65275 66489 22 5692 7439 22 26.5 30.1
32 30039 30367 32 2295 3446 32 31.9 38.9

(a) Cloned LOC (b) Num. of Clone Groups (c) Avg. Cloned LOC / Group
Figure 11: Coverage results for the Linux Kernel.

1 void os_event_free(os_event_t event)


1 static void os_event_free_internal(os_event_t event)
2 {
2 {
3 ut_a(event);
3 ut_a(event);
4 os_fast_mutex_free(&(event−>os_mutex));
4 /* This is to avoid freeing the mutex twice */
5 ut_a(0 == pthread_cond_destroy(&(event−>cond_var)));
5 os_fast_mutex_free(&(event−>os_mutex));
6 /* Remove from the list of events */
6 ut_a(0 == pthread_cond_destroy(&(event−>cond_var)));
7 os_mutex_enter(os_sync_mutex);
7 /* Remove from the list of events */
8 UT_LIST_REMOVE(os_event_list, os_event_list, event);
8 UT_LIST_REMOVE(os_event_list, os_event_list, event);
9 os_event_count−−;
9 os_event_count−−;
10 os_mutex_exit(os_sync_mutex);
10 ut_free(event);
11 ut_free(event);
11 }
12 }

Figure 13: Another example of semantic clones differing only by global locking (MySQL).
6. DISCUSSION AND RELATED WORK and the ability to detect interleaving patterns might increase the
This work presents the first scalable clone analysis that incorpo- scope of the analysis.
rates semantic information. Komondoor and Horwitz [12] use the Another potential interesting application of this work is soft-
dependence graph to find semantically identical code fragments. ware plagiarism detection. Current, well-used tools include Moss
They also successfully use this technique [11] to identify candi- [18] and JPlag3 , but these are too coarse grained to find general
dates for automatic procedure extraction. Our work also has the sets of code clones. Liu et al. [16] have recently developed the
potential to be used in this way: our semantic threads are similar GPLAG tool, which applies subgraph isomorphism testing to PDGs
to the subgraphs that they discover and extract. Their work relies to identify plagiarized code. They note that PDGs are resilient to
heavily on expensive graph algorithms and pairwise comparisons semantics-preserving modifications like (unrelated) statement in-
and does not scale like ours: they report analysis times of more sertion, statement reordering, and control replacement. Our tech-
than an hour [12] (not including the PDG build) for a 10,000 line nique can easily handle interleaved clones, which are characteristic
program. Our algorithm’s scalability would allow us to analyze of code with purposefully inserted garbage statements. We expect
larger projects that may have a greater number of duplicate code that our technique can be straightforwardly extended to handle con-
fragments. This scalability also makes possible a more detailed trol replacement as well.
and direct comparison with different techniques and tools, similar We also handle statement reordering: we eliminate ordering in-
to the experiments performed by Bellon et al. [4]. formation as a result of our transformation of trees to characteristic
We use a scalable, approximate technique for solving the tree vectors. Our scalability provides additional opportunities. For ex-
similarity problem. Wahler et al. [19] use frequent itemset mining ample, our tool could be used to perform open source license com-
on serialized representations of ASTs to detect clones. Other tech- pliance checks for proprietary software. In this mode, we could
niques [13, 17] generate fingerprints of subtrees and report code generate a large body of vectors representing common or related
with similar fingerprints as clones. Compared with our vector- open source projects and include them in the clustering phase. We
based clone detection, these techniques are less scalable and more leave for future work the evaluation of our tool’s applicability to
coarse grained. plagiarism detection.
Most potential applications for purely syntactic clone detection Aside from the scale issues of performing both pairwise compar-
are also feasible for semantics-assisted clone detection, and other, isons and subgraph isomorphism testing, GPLAG considers only
new applications exist as well. In the previous section, we iden- top level procedures as candidates for clones. We are able to con-
tified code patterns that our tool was able to find with the aid of sider a much larger set that includes smaller code fragments. Their
dependency information. Bruntink et al. studied the capabilities definition of code similarity as general subgraph isomorphism is
of token-based and AST-based clone detection tools for detecting also less refined than ours: two isomorphic subgraphs that cross
crosscutting concerns [5]. Our PDG-based clone definition may logical flows of data are not likely to be interesting clones. We
further facilitate such a detection since crosscutting concerns may more carefully enumerate these flows as semantic threads.
form large semantic threads. Li and Zhou [15] use frequent item-
set mining to identify similar code patterns. Their technique is 7. CONCLUSIONS AND FUTURE WORK
highly scalable as well, but the mined properties lack temporal
This paper presents the first scalable algorithm for semantic clone
information–only association. The patterns we inferred are specific
detection based on dependence graphs. We have extended the def-
and precise, reflecting direct data flow relationships. However, we
inition of a code clone to include semantically related code and
found fewer total patterns. We leave for future work the study of
provided an approximate algorithm for locating these clone pairs.
our tool’s efficacy in mining true specifications and the evaluation
We reduced the difficult graph similarity problem to a tree similar-
of these pattern and data-based specifications against those found
ity problem by mapping interesting semantic fragments to their re-
by automaton-learning techniques [1].
lated syntax. We then solved this tree-based problem using a highly
Clone detection has also been used to detect design level similar-
scalable technique. We have implemented a practical tool based
ities. Basit and Jarzabek [2] use CCFinder to detect syntactic clone
on our algorithm that scales to millions of lines of code. It finds
fragments and later correlate them using data mining techniques.
Our semantics-based technique could be used in this way as well, 3
http://www.jplag.de
1 struct nfs_server *server = NFS_SB(sb);
1 struct nfs_server *server = NFS_SB(sb);
4 struct inode *inode;
5 struct inode *inode;
5 int error;
6 int error;
7 dprintk("--> nfs4_get_root()\n");
8 /* create a dummy root dentry with dummy inode for this superblock */
9 if (!sb−>s_root) {
9 /* create a dummy root dentry with dummy inode for this superblock */
10 struct nfs_fh dummyfh;
10 if (!sb−>s_root) {
16 nfs_fattr_init(&fattr);
17 nfs_fattr_init(&fattr);
17 fattr.valid = NFS_ATTR_FATTR;
18 fattr.valid = NFS_ATTR_FATTR;
18 fattr.type = NFDIR;
19 fattr.type = NFDIR;

Figure 14: Partial semantic clones differing only by a debugging statement (Linux).

1 rel = heap_open(TypeRelationId, RowExclusiveLock);


1 pg_index = heap_open(IndexRelationId, RowExclusiveLock);
2 tup = SearchSysCacheCopy(TYPEOID, ObjectIdGetDatum(typeOid),
2 indexTuple = SearchSysCacheCopy(INDEXRELID,
0, 0, 0);
ObjectIdGetDatum(indexRelationId), 0, 0, 0);
4 if (!HeapTupleIsValid(tup))
4 if (!HeapTupleIsValid(indexTuple))
5 elog(ERROR, "cache lookup failed for type %u",
5 elog(ERROR, "cache lookup failed for index %u",
typeOid);
indexRelationId);
7 typTup = (Form_pg_type) GETSTRUCT(tup);
7 indexForm = (Form_pg_index) GETSTRUCT(indexTuple);
9 typTup−>typowner = newOwnerId;
9 Assert(indexForm−>indexrelid = indexRelationId);
10 Assert(!indexForm−>indisvalid);
11 simple_heap_update(rel, &tup−>t_self, tup);
11 indexForm−>indisvalid = true;
12 CatalogUpdateIndexes(rel, tup);
13 simple_heap_update(pg_index, &indexTuple−>t_self, indexTuple);
14 /* Update owner dependency reference */
14 CatalogUpdateIndexes(pg_index, indexTuple);
15 changeDependencyOnOwner(TypeRelationId, typeOid, newOwnerId);
16 heap_close(pg_index, RowExclusiveLock)
17 heap_close(rel, RowExclusiveLock); /* Clean up */

Figure 15: An example of semantic clones revealing a pattern (PostgreSQL).


strictly more clones than previous syntax-only techniques, and it is [10] T. Kamiya, S. Kusumoto, and K. Inoue. CCFinder: a
capable of producing interesting sets of semantically similar code multilinguistic token-based code clone detection system for
fragments. large scale source code. TSE, 28(7), 2002.
For future work, we are interested in developing an intraprocedu- [11] R. Komondoor and S. Horwitz. Semantics-preserving
ral analysis framework that could aid us in generating PDGs more procedure extraction. In POPL, 2000.
quickly and for other languages. We also plan to explore applica- [12] R. Komondoor and S. Horwitz. Using slicing to identify
tions of this technique. duplication in source code. In SAS, 2001.
[13] K. Kontogiannis, R. de Mori, E. Merlo, M. Galler, and
8. REFERENCES M. Bernstein. Pattern matching for clone and concept
detection. Automated Soft. Eng., 3(1/2), 1996.
[1] G. Ammons, R. Bodík, and J. R. Larus. Mining [14] Z. Li, S. Lu, S. Myagmar, and Y. Zhou. CP-Miner: A tool for
specifications. In Proceedings of POPL, 2002. finding copy-paste and related bugs in operating system
[2] H. A. Basit and S. Jarzabek. Detecting higher-level similarity code. In OSDI, 2004.
patterns in programs. In ESEC/FSE, 2005. [15] Z. Li and Y. Zhou. PR-Miner: Automatically extracting
[3] I. D. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier. implicit programming rules and detecting violations in large
Clone detection using abstract syntax trees. In ICSM, 1998. software code. In ESEC/FSE, 2005.
[4] S. Bellon. Comparison and evaluation of clone detection [16] C. Liu, C. Chen, J. Han, and P. S. Yu. GPLAG: detection of
tools. IEEE Trans. Softw. Eng., 33(9), 2007. Member-Rainer software plagiarism by program dependence graph analysis.
Koschke and Member-Giulio Antoniol and Member-Jens In KDD ’06: Proceedings of the 12th ACM SIGKDD
Krinke and Member-Ettore Merlo. international conference on Knowledge discovery and data
[5] M. Bruntink, A. van Deursen, T. Tourwe, and R. van mining, 2006.
Engelen. An evaluation of clone detection techniques for [17] J. Mayrand, C. Leblanc, and E. Merlo. Experiment on the
identifying crosscutting concerns. In Proceedings of ICSM, automatic detection of function clones in a software system
2004. using metrics. In ICSM, 1996.
[6] J. Ferrante, K. J. Ottenstein, and J. D. Warren. The program [18] S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing:
dependence graph and its use in optimization. ACM Trans. local algorithms for document fingerprinting. In SIGMOD,
Program. Lang. Syst., 9(3), 1987. 2003.
[7] A. Gionis, P. Indyk, and R. Motwani. Similarity search in [19] V. Wahler, D. Seipel, J. W. von Gudenberg, and G. Fischer.
high dimensions via hashing. In VLDB, 1999. Clone detection in source code by frequent itemset
[8] S. Horwitz, T. Reps, and D. Binkley. Interprocedural slicing techniques. In SCAM, 2004.
using dependence graphs. ACM Trans. Program. Lang. Syst., [20] M. Weiser. Program slicing. In ICSE ’81: Proceedings of the
12(1), 1990. 5th international conference on Software engineering, 1981.
[9] L. Jiang, G. Misherghi, Z. Su, and S. Glondu. Deckard: [21] R. Yang, P. Kalnis, and A. K. H. Tung. Similarity evaluation
Scalable and accurate tree-based detection of code clones. In on tree-structured data. In SIGMOD, 2005.
Proceedings of ICSE, 2007.

You might also like