LLVM-based Code Clone Detection Framework
LLVM-based Code Clone Detection Framework
Arutyun Avetisyan, Shamil Kurmangaleev, Sevak Sargsyan, Mariam Arutunian, Andrey Belevantsev
Institute for System Programming of the Russian Academy of Sciences
Moscow, Russia
Email: {arut,kursh,sevaksargsyan,arutunian,abel}@ispras.ru
Abstract— Existed methods of code clones detection have some The second part analyzes PDGs for code clones detection. It
restrictions. Textual and lexical approaches cannot detect strongly contains a number of new algorithms for PDGs’ splitting and
modified fragments of code. Syntactic and metrics based similar subgraphs detection. Due to the use of combined
approaches detect strong modifications with low accuracy. On the algorithms the tool is scalable up to millions lines of source
contrary, semantic approach accurately detects the cloned code. Two types of algorithms are used for maximal isomorphic
fragments of code with small changes as well as the strongly subgraphs detection. The first type of algorithms tries to prove
modified ones. Methods based on this approach are not scalable that the pair of PDGs cannot have the desired isomorphic
for analysis of large projects. This paper describes LLVM-based subgraphs. The most of PDGs’ pairs are processed by them.
code clone detection framework, which uses program semantic
These algorithms have liner complexity. The second type is
analysis. It has high accuracy and is scalable for analysis million
approximate algorithms for maximal isomorphic subgraphs
lines of source code. The tool embeds a testing system, which
allows generating code clones for the project automatically. It is detection. These algorithms are applied if algorithms of the first
used for determining the developed algorithms accuracy. The type are failed. They have high computational complexity.
instrument is applicable for all languages that can be compiled to The third part is responsible for testing the developed
LLVM bitcode. Proposed method was compared with two widely algorithms. It automatically generates a set of code clones for a
used tools MOSS and CloneDR. Results show that it has higher project and runs the clone detection algorithms. The number of
accuracy. The tool is scalable for analysis of linux-2.6 kernel,
clones detected by the specific algorithm specifies its
which has about fourteen millions lines of source code.
correctness.
Keywords—code clone, program dependence graph, LLVM. II. BACKGROUND
A. Clone Types
I. INTRODUCTION There are three basic clone types [3]. The first type is the
Software developers often reuse the same fragments of code identical code fragments except the variations in whitespace
many times by making small modifications. Hard deadlines (may be also variations in layout) and comments (T1). The
usually increase copy-paste activities, which increase the second type is the structurally/syntactically identical code
number of code clones. Code cloning can lead to many semantic fragments except the variations in identifiers, literals, types,
errors. For example, software developer can forget to rename layout and comments (T2). The third type is the copied
some variable after copy-paste. The software, which has many fragments of code with further modifications. Statements can be
clones, probably will have many mistakes and low quality. changed, added or removed in addition to variations in
According to different studies [1, 2] up to 20% of source code identifiers, literals, types, layout and comments (T3) (Fig. 1).
can belong to clones. Clone detection tools are widely used:
Original source Clone Type 1
During software development to avoid mistakes void sumProd(int n) { void sumProd(int n) {
float sum = 0.0; float sum = 0.0; //C1
and improve its quality; float prod = 1.0; float prod = 1.0; // C2
for (int i = 1; i<=n; i++) { for (int i = 1; i <= n; i++) {
For automatic refactoring;
sum = sum + i; ____ sum = sum + i;
For code size optimizations; prod = prod * i; ____ prod = prod * i;
foo(sum, prod); ____ foo(sum, prod);
For semantic errors detection. } } }}
Clone Type 2 Clone Type 3
The goal of this paper is to introduce LLVM-based code void sumProd(int n) { void sumProd(int n) {
clone detection framework. It is based on semantic analysis of int s = 0; //C1 int s = 0; //C1
the program and is scalable up to millions lines of source code. int p = 1; // C2 int p = 1; // C2
The instrument consists of three basic parts. for (int i = 1; i <= n; i++) { for (int i = 1; i <= n; i++) {
____ s = s + i; ____ s = s + i * i;
The first part is responsible for program dependence graphs ____ p = p * i; // deleted
(PDG) generation. PDGs are constructed during project’s build ____ foo(s, p); ____ foo(s, p);
time, which allows creating these graphs without additional }} }}
source code analysis.
Fig. 1. Examples for three clone types.
100
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on January 05,2025 at 19:17:53 UTC from IEEE Xplore. Restrictions apply.
B. Code clone detection approaches middle level PDG also includes edges obtained by the alias
There are five [4, 5] basic approaches for code clone analysis. The full PDG contains all data and control
detection. dependencies. This approach allows avoiding wasting unneeded
resources, e.g., the minimal PDG is enough for accurate
Methods based on textual approach consider the source code detection of T1 and T2 clone types. LLVM provides compiler
as text and try to find equal substrings [6]. These substrings APIs and has a large set of optimization libraries. Due to this,
are clones. When all clones are found, clones which are many programming languages provide source code translation
located nearby can be combined to one. Basically T1 clones to LLVM bitcode. Therefore we can apply the developed tool
are found. for all these languages.
In case of lexical approach source code is parsed as a
sequence of tokens. Then the longest common subsequence CLANG
is determined. There are a few effective algorithms based
on the parameterized suffix tree for clone detection [7]. One
more interesting method transforms Java code to an PASS
intermediate representation and compares them instead of PDG PASS
the original source [8]. These types of algorithms can find PDG PASS PDG construction
basically T1 and T2 clone types. PDG optimizations
LLVM PDG serialization
The next is the syntactic approach. The algorithm works on
Abstract Syntax Tree (AST). In this case clones are matched PASS
AST subtrees. Some algorithms directly compare two ASTs
to find common subtrees [9]. Another algorithm constructs
vectors of AST subtrees and compares them [10].
Algorithms based on this approach find all three types of EXECUTABLE
clones.
Fig. 2. LLVM based model for PDGs’ generation.
Metrics-based algorithms are widely used for clone
detection. Algorithms based on this method compute a IV. CLONE DETECTION
number of metrics for code fragments and compare them.
The clone detection is a multistage process. First, generated
Basically these metrics are computed for ASTs and PDGs
PDGs are loaded to memory, and then four basic steps are
[11]. Another method clusters computed metrics by using
neural networks [12]. Metrics-based algorithms have better performed (Fig. 3). The first step is splitting PDGs to
performance than AST or PDG comparison algorithms, but subgraphs. These subgraphs are considered as potential clones
have low accuracy. of each other. The second step is the application of fast check
algorithms. These algorithms have linear complexity and try to
The last one is the semantic approach. The source code is prove that a pair of PDGs cannot have big enough isomorphic
parsed to PDG. PDG nodes are program instructions subgraphs. The third stage is the maximal isomorphic
whereas PDG edges are dependences between those subgraphs detection. New algorithm, based on slice (Section
instructions. Algorithms based on PDG try to find maximal 4.3) is proposed for maximal isomorphic subgraphs detection.
isomorphic subgraphs for a pair of PDGs [13, 14, 15]. All The fourth step is the filtration of the obtained pairs of maximal
algorithms are approximate because maximal isomorphic isomorphic subgraphs. The last step is printing of the
subgraphs detection is an NP-hard problem. PDG-based
corresponding source code for isomorphic subgraphs as
methods have high accuracy but low performance.
detected clones.
Textual and lexical approaches are not effective for detecting
clones of T3 type. AST and metrics based methods detect T3 Stored PDGs
type of clones with low accuracy. Only semantic analysis allows
reaching high accuracy.
Code Clone Detector
III. PDG GENERATION Load PDG
Split PDG to subgraphs
PDGs for the project are generated based on the LLVM Fast checks
intermediate representation called a bitcode. The LLVM pass is Code clone detection algorithm
added for these graphs generation (Fig. 2). The generation Filtering
happens during the project compile time. It allows constructing Printing
graphs for large scale projects effectively. PDG graph’s vertices
are LLVM bitcode [16] instructions. Edges are obtained based
on LLVM use-def [16], alias and control flow analyses. Those Fig. 3. Basic stages of code clones detection.
vertices which have no edges are removed, after which the
optimized PDGs are stored to files. The tool allows generating A. Splitting PDGs
PDG graphs in three different ways. Edges of the minimal PDG Three methods are implemented for splitting. The first
are constructed based only on LLVM use-def analysis. The method splits PDG to weakly connected components. The
101
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on January 05,2025 at 19:17:53 UTC from IEEE Xplore. Restrictions apply.
second method splits PDG to subgraphs, where every pair has [18] are used for similar nodes detection. For all nodes of both
less than N common nodes [17]. These two methods have two PDGs bit vectors [18] are constructed. The most similar nodes
basic disadvantages: subgraphs’ sizes might have big variation; are chosen based on similarity function [18]. The second
corresponding source code lines for one subgraph might be approach considers vertices with a maximal number of
located far from each other. To avoid these disadvantages, the neighbors from the first PDG and tries to find identical vertices
third method is proposed. PDG edges are considered as source (with neighbors) from the second PDG.
code ranges (Fig. 4).
D. Filtration
1: a = 10; 1: a = 10; 3: x = 2; The last stage in the process of code clone detection is the
2: b = a*5; filtration of some detected pairs of isomorphic subgraphs. The
3: x = 2; need for a filter arises from the fact that the concept of code
4: x = x*x 2: b = a*5; 4: x = x* x clone is defined for source code of the program, but isomorphic
subgraphs are considered as clones. A code clone must present
Vertices Intersected G a sequence of lines in the file (not necessarily consecutive, but
ranges not highly dispersed). The purpose of filtering is to verify that
1 0 1: a = 10; 3: x = 2; the source code for the corresponding isomorphic subgraphs is
2 1 not much scattered.
3 0 2: b = a*5; 4: x = x* x
V. AUTOMATIC CLONE GENERATION
4 1
G1 G2 Two approaches are suggested for automatic generation of
code clones. The first method uses obfuscation [19] and standard
Fig. 4. Example of PDG’s splitting. transformation, optimization passes of LLVM. For every
function of LLVM bitcode two PDGs are constructed. The first
For PDG’s corresponding source code lines, the numbers of PDG comes from the original code and is constructed based on
intersected ranges are considered. Source code is split based on LLVM bitcode generated by the Clang compiler. The second
those lines, which have minimum number of intersecting ranges. PDG is the clone PDG, and it is constructed based on the
Corresponding subgraphs for split code fragments are transformed/obfuscated bitcode. Standard passes of LLVM are
considered for clone detection. Experimental results show that applied to bitcode for transformation (Fig. 5).
this splitting method allows detecting about 1.5-2 times more
clones than the first and the second methods.
CLANG
B. Fast checks PASS
These algorithms have liner complexity and try to prove that
the pair of PDGs does not have big enough isomorphic LLVM bitcode
……………..
subgraphs. Two nodes of PDG are similar if their types are the
same. Fast check algorithms compare PDG’s nodes based on
their types. If the algorithm was not able to detect enough pairs LLVM
of similar nodes in the corresponding graphs, these graphs Comparison
PASS
cannot have big enough isomorphic subgraphs.
The first algorithm stores PDG nodes in a hash set, the key
for the set is the node’s type. If the size of intersection, for the
sets of corresponding pair of PDGs, is not big enough then this Modified PDG Original PDG
pair of PDGs does not have the desired isomorphic subgraphs.
The second algorithm computes a characteristic vector for Fig. 5. LLVM-based clone generation model.
every PDG. Elements of this vector are count of nodes with
specific type. If the Euclidean distance for corresponding vectors The second method merges the original program PDGs to
of considered pair of PDGs is too big then this pair fails the fast generate code clones (Fig. 6). Three methods are applied for
check. PDGs’ merge. The first method performs the union of two PDGs
without adding extra edges or vertices. The second method
C. Slice-based clone detection unions a pair of PDGs and also adds extra random edges
For the given PDG’s pair, candidate pairs of nodes are between the nodes of the corresponding graphs. The third
constructed. The first node in the pair is from the first PDG, the method considers nodes of the first PDG and tries to find the
second one from the second PDG. For every pair of nodes similar nodes in the second PDG. If the similar node is detected
backward and forward slices [13] are applied to construct then all neighbors of this node are added to the first PDG with
isomorphic subgraphs. Maximal isomorphic subgraphs are their corresponding edges.
selected from the constructed set of isomorphic subgraph pairs. To check correctness of the implemented clone detection
Two approaches are developed for candidate set algorithms, original and cloned PDGs are compared. The
construction. The first approach chooses for every node of the number of detected clones specifies the correctness and power
first PDG the most similar node from the second PDG. Metrics of the tested algorithm.
102
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on January 05,2025 at 19:17:53 UTC from IEEE Xplore. Restrictions apply.
MOSS, CloneDR and developed three methods for clone
Original list of PDGs
detection.
B. PDG generation time
PDG 1 …………………… PDG N Fig. 7 shows lines of source code for analyzed projects.
Linux kernel has about fourteen million lines of source code
PDG’ 1 ……… PDG’ N/2 written in the C language. Fig. 8 shows compilation time of the
project with and without PDGs’ generation. In the worst case
compile time increases only by ~30%.
Comparison Merged list of PDGs
PDG’ j 16
14 Source code lines
PDG i PDG’ j 12 (million lines)
PDG i PDG k
10
8
Fig. 6. Clone generation based on PDG’s merging.
6
VI. RESULTS 4
2
The developed tool was applied to a number of widely used
0
libraries and software systems. It was compared with other tools Linux 2.6 Mozilla LLVM OpenSSL
of clone detection. The tests were run on a machine with Intel
Core i3 CPU 540 and 8GB RAM. Fig. 7. Lines of source code for projects.
0
The first test (Original Code) was modified in different ways
Linux 2.6 Mozilla LLVM OpenSSL
to obtain all three types of clones. The paper [22] contains more
details for all tests. Theoretically all files are clones, because Fig. 9. False positive rate of clone detection.
they were obtained by modification of the single test. Clone
detection tool with high accuracy should determine as much ACKNOWLEDGMENT
clones as possible. Tab.1 shows results of comparison for
The paper is supported by RFBR grant 15-07-07541.
103
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on January 05,2025 at 19:17:53 UTC from IEEE Xplore. Restrictions apply.
REFERENCES qualitative approach”, Science of Computer Programming, vol. 74,
no. 7, pp. 470-495, 2009.
[1] B. Baker, “On finding duplication and near-duplication in large
software systems”, in: Proceedings of the 2nd Working Conference
on Reverse Engineering, 1995, pp. 86-95.
[2] C. K. Roy and J. R. Cordy, “An empirical study of function clones
in open source software systems”, in: Proceedings of the 15th
Working Conference on Reverse Engineering, 2008, pp. 81-90.
[3] S. Bellon, R. Koschke, G. Antoniol, J. Krinke, and E. Merlo,
“Comparison and evaluation of clone detection tools”, Transactions
on Software Engineering, vol. 33, no. 9, pp. 577–591, 2007.
[4] D. Rattana, R. Bhatiab, and M. Singhc, “Software clone detection”,
Information and Software Technology, vol. 55, no. 7, pp. 1165-
1199, 2013.
[5] S. Gupta and P. C. Gupta, “Literature survey of clone detection
techniques”, International Journal of Computer Applications, vol.
99, no. 3, pp. 41-44, 2014.
[6] S. Ducasse, M. Rieger, and S. Demeyer, “A language independent
approach for detecting duplicated code”, in: Proceedings of the 15th
International Conference on Software Maintenance, 1999, pp. 109-
119.
[7] T. Kamiya, S. Kusumoto, K .Inoue, “CCFinder: a multilinguistic
token-based code clone detection system for large scale source
code”, IEEE Transactions on Software Engineering, vol. 28, no. 7,
pp. 654-670, 2002.
[8] R. Kaur and S. Singh, “Clone detection in software source code
using operational similarity of statements”, ACM SIGSOFT
Software Engineering Notes, vol. 39, no. 3, pp. 1-5, 2014.
[9] I. Baxter, A. Yahin, L. Moura, M. Anna, “Clone detection using
abstract syntax trees”, in: Proceedings of the 14th IEEE
International Conference on Software Maintenance, 1998, pp. 368-
377.
[10] L. Jiang, G. Misherghi, Z.Su, and S.Glondu, “DECKARD: Scalable
and accurate tree-based detection of code clones”, in: Proceedings
of the 29th International Conference on Software Engineering,
2007, pp. 96-105.
[11] J. Mayrand, C. Leblanc, and E. Merlo, “Experiment on the
automatic detection of function clones in a software system using
metrics”, in: Proceedings of the 12th International Conference on
Software Maintenance, 1996, pp. 244-253.
[12] N. Davey, P. Barson, S. Field, and R. Frank, “The development of a
software clone detector”, International Journal of Applied Software
Technology, vol. 1, no. 3/4, pp. 219-236, 1995.
[13] R. Komondoor and S.Horwitz, “Using slicing to identify
duplication in source code”, in: Proceedings of the 8th International
Symposium on Static Analysis, 2001, pp.40-56.
[14] J. Krinke, “Identifying similar code with program dependence
graphs”, in: Proceedings of the 8th Working Conference on Reverse
Engineering, 2001, pp.301-309.
[15] S. Sargsyan, S. Kurmnagaleev, A. Belevantsev, H. Aslanyan, A.
Baloian, “Scalable code clone detection tool based on semantic
analysis”, The Proceedings of ISP RAS, vol. 27, no. 1, pp. 39-49,
2015.
[16] The LLVM Compiler Infrastructure. [Online]. Available:
www.llvm.org
[17] M. Gabel, L. Jiang, Z. Su, “Scalable detection of semantic clones”,
in: Proceedings of the 30th International Conference on Software
Engineering, 2008, pp.321–330.
[18] S. S. Sargsyan, S. F. Kurmangaleev, A.V. Baloian, and H. K.
Aslanyan, “Scalable and accurate clones detection based on metrics
for dependence graph”, Mathematical Problems of Computer
Science, vol. 42, pp. 54-62, 2014.
[19] S.F. Kurmangaleev, V. P. Korchagin, V. V. Savchenko, and S.S.
Sargsyan, “Building an obfuscation compiler based on LLVM
infrastructure”, The Proceedings of ISP RAS, vol. 23, pp. 77-92,
2012.
[20] MOSS: A System for Detecting Software Plagiarism.
[Online]. Available: http://theory.stanford.edu/~aiken/moss/
[21] Clone Doctor: Software Clone Detection and Reporting.
[Online]. Available: www.semdesigns.com/products/clone/
[22] C. K. Roya, J. R. Cordya, and R. Koschkeb, “Comparison and
evaluation of code clone detection techniques and tools: A
104
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on January 05,2025 at 19:17:53 UTC from IEEE Xplore. Restrictions apply.