A Visual Analysis Clone Method
A Visual Analysis Clone Method
to detect semantically similar code. Embedding-based instead of relying on post-hoc analysis after technical debt
techniques, including neural embeddings or transformer- has accrued.
based representations, leverage large datasets of code to learn In this study, we propose a visual clone management approach
generalizable features that are well suited to detecting near-miss or that builds upon these lines of research. By combining a hybrid
semantic clones (Yahya and Kim, 2023; Zhang and Saber, 2025a; detection technique—one that integrates textual, lexical, and tree-
Pinku et al., 2024). based features—with an interactive visualization layer, we aim
Refactoring clone pairs or groups helps control maintenance to provide both robust detection of near-duplicate code and
complexity by unifying duplicate logic. In particular, refactoring developer-friendly tools for prioritizing and refactoring high-
operations such as the Extract Method can convert multiple clones impact clones. To validate our approach, we compare detection
into a single shared function, mitigating the risk of missing bug results against established benchmarks and demonstrate how visual
fixes or enhancements in one cloned location (AlOmar et al., 2024; insights in the IDE can streamline clone analysis and management.
Lei et al., 2022). However, clone refactoring can be challenging in Our ultimate goal is to enhance maintainability by leveraging state-
large-scale systems, especially when clones occur across different of-the-art code similarity insights while placing the developer at
modules, repositories, or even programming languages (Kanwal the center of the clone management loop (Wang et al., 2023; Pinku
et al., 2022; Li et al., 2023). Cross-language clone detection and et al., 2024).
management present additional hurdles, as equivalent functionality Finally, we emphasize the growing need for cross-project or
may appear in distinct languages, each with its own APIs, idioms, cross-repository clone analysis, especially as organizations adopt
and libraries (Yahya and Kim, 2023; Khajezade et al., 2024). Despite microservices architectures or polyglot development practices.
these challenges, recent work has begun to merge deep learning In that context, large-scale empirical evaluations with real-
with cross-language embeddings to detect functional similarity at world codebases are crucial for demonstrating the feasibility and
a higher level of abstraction (Alam et al., 2023; Zhang and Saber, cost-effectiveness of advanced clone detection tools (Xu et al.,
2025a). 2024; Lei et al., 2022). By presenting our tool’s design and
The rise of transformer-based large language models (LLMs) empirical outcomes, we hope to further the conversation on how
that have been trained on extensive code corpora—such as GPT or best to integrate clone detection and refactoring into everyday
CodeBERT—has opened new avenues for discovering non-trivial development workflows, mitigating risks associated with duplicated
clones and recommending potential refactorings (Zhang and Saber, logic while helping developers maintain clear, consistent, and
2025b; Feng et al., 2024). Researchers are currently investigating evolvable software (Alam et al., 2023; Khajezade et al., 2024).
how best to harness these models, for example, by fine-tuning them The repository for the source code is available on GitHub under
on labeled clone datasets to capture both syntactic and semantic the account jnavas-tec, and the project is vizclone.
relationships among code fragments (Xu et al., 2024; Pinku et al., In the rest of this article, we first present the proposed method
2024). Although LLMs show promise in capturing deeper semantic in Section 2, followed by the validation of our proposed Clone
structures, they can also generate spurious matches or struggle Detection Algorithm in Section 3. We present the results in Section
with domain-specific libraries, highlighting the need for further 4, and Section 5 contains the discussion of the results.
investigation into model architectures and tokenization strategies
(Khajezade et al., 2024; Assi et al., 2025).
Beyond detection, clone management is another essential 2 Materials and methods
area of study. Effective management involves ranking and
tracking clones throughout the software lifecycle, visualizing In this section, we present our proposed method for clone
their relationships, and prioritizing them for refactoring (Zakeri- management, which consists of two main stages: a screening and
Nasrabadi et al., 2023; Kaur and Rattan, 2023). Some tools now detection stage described in Section 2.1, and a visualization and
incorporate techniques like complexity metrics or usage frequency analysis stage described in Section 2.2. The diagram in Figure 1
to automatically suggest which clones should be refactored first depicts the complete steps of the proposed method.
(Kanwal et al., 2022; Sundelin et al., 2025). In practice, decisions
about whether to refactor or keep duplicated code can also
depend on performance considerations, the developer’s workflow, 2.1 Clones screening and detection
or domain-specific constraints.
Visual analysis is one emerging trend that can aid developers in Our method detects code clones at the method granularity
spotting and managing clones at scale. By rendering clone groups or level, mainly because code clones are disseminated primarily across
classes in an interactive graph or heatmap, developers can navigate methods. However, even those clone classes with many cloned
clusters of related code and quickly pinpoint large or complex fragments across different methods represent a tiny fraction of
duplications (Zhang and Saber, 2025a; Zakeri-Nasrabadi et al., a software system’s total number of methods. Thus, comparing
2023). Furthermore, integrating clone detection and refactoring every method in the system against every other method is a
support directly into modern IDEs can facilitate a continuous time-wasting task. To avoid this, a fast and scalable algorithm
clone management process, alerting developers whenever they for finding similar items performs a screening of candidate
introduce a new clone or significantly alter an existing one (AlOmar pairs of methods with a minimum expected probability of being
et al., 2024; Hu et al., 2023). This model of “just-in-time,” clone code clones. This algorithm implements the technique of Near-
awareness supports more proactive handling of code duplication, Neighbor Search that combines the techniques of K-Shingles,
Minhash Signatures, and Locality-Sensitive Hashing, as described iterations over all the methods’ Minhash signatures. The i-th
in Leskovec et al. (2020), and is a textual clone detection approach. iteration hashes the i-th band of each method’s Minhash signature
We succinctly explain our implementation of their algorithm and collects the methods that are hashed to the same bucket as
in Section 2.1.1. A Syntactic-Hierarchical Local-Global Sequence clone candidate pairs. The sets of buckets in each iteration are
Alignment Algorithm verifies the set of candidate pairs of code independent of each other.
clones returned by the screening algorithm. This verification Our implementation of LSH can detect clone candidate pairs
algorithm combines a syntactic-hierarchical analysis of methods with a similarity of at least s by choosing b and r such that s =
and their sentences with local and global sequence alignment, as (1/b)1/r . For example, if we set the minimum similarity to s = 0.8
shown in Section 2.1.2. The local sequence alignment-based sub- and the number of rows per band to r = 18, then the number
algorithm implements a lexical approach, while the global sequence of bands is b = 56, and the length of the Minhash signatures
alignment-based sub-algorithm implements a tree-based approach. is n = 1008. On the one hand, to decrease the number of false
We use the IDE internals to syntactically collect and negatives, we can adjust b and r to achieve a value of s lower than
analyze all the methods in the project. The developer can 0.8. On the other hand, to decrease the number of false positives,
choose whether to compare the actual instances of literals and we can set b and r to produce s greater than 0.8.
identifiers (e.g., cloneList, 80.0, “a string”) or to compare them
by their token names (e.g., IDENTIFIER, FLOAT_LITERAL,
STRING_LITERAL). The methods are then represented
2.1.2 Syntactic-hierarchical local-global
hierarchically as a sequence of sentences, with each sentence
sequence alignment algorithm
being a sequence of tokens.
Once we have the list of clone candidate pairs, the algorithm
must verify which are indeed code clones. Our clone verification
2.1.1 Near-neighbor search screening algorithm algorithm takes every pair of candidates and performs a syntactic-
The problem of detecting code clones can be considered hierarchical local-global sequence alignment step. Syntactic means
a subcase of finding similar documents from a large set of leveraging the Abstract Syntax Tree (AST) constructed by the
documents. In the context of document similarity, we employ IDE’s language compiler to extract each method and its tokenized
the Jaccard similarity measure to determine how similar two sentences. Each method then consists of a syntactic hierarchy of
documents are. The Jaccard similarity between two sets A and B lists: a list of sentences identified by their type (i.e., if-statement,
is |A ∩ B|/|A ∪ B|. while-statement, assignment-statement) and their sublists of
Comments and extra white spaces are trimmed from the source tokens. Moreover, local-global alignment means applying the local
code. The signature of the method is also ignored. The algorithm sequence alignment algorithm from Smith and Waterman (1981)
ignores the methods with fewer sentences than a minimum with the optimization by Gotoh (1982) at the sentence level while
threshold of sentences or fewer tokens than a minimum threshold applying the global sequence alignment algorithm from Needleman
of tokens. The source code from the body of the methods is and Wunsch (1970) at the token level.
represented as a set of k-shingles to identify similar sequences in The best local alignment found between the sentences of two
different methods. A shingle is a substring of length k extracted clone candidate methods is flagged as a clone if the similitude
from the code. The following text “Sample_text” contains the of the alignment is at least the selected similitude threshold
following set of 5-shingles: {“Sampl”, “ample”, “mple_”, “ple_t”, (i.e., 0.8 similitude) and if the alignment has at least a selected
“le_te”, “e_tex”, and “_text”}. We selected a value of k equal to minimum number of sentences and tokens. The local alignment
27 to keep a low probability that a shingle would appear in any algorithm compares each of the sentences of a method against
method. The sentences and tokens of the methods are joined with every other sentence of the other method. It considers them
whitespace and fed to the shingle extractor. All the shingles from similar if they are of the same sentence type and if the global
all the methods are extracted and added to a list, and then their sequence alignment of their tokens’ returned similitude is at least
index replaces all the shingles in all the methods. The sets of shingle the selected similitude threshold. Otherwise, the algorithm treats
indices represent the methods, and we could obtain the similarity them as a mismatch or introduces a gap, whichever yields the most
level between any two methods by calculating the Jaccard similarity similar alignment.
of their two shingle index sets. However, doing so would leave us If the local alignment for a verified clone pair contains gaps or
back at square one, comparing every pair of methods. Instead, the any mismatches in the alignment, then it recognizes it as a Type 3
algorithm replaces all the shingle index sets from the methods with clone. However, if it does not contain any gaps or mismatches and
their Minhash signatures. the source code of the sentences is identical, then it identifies it as a
The algorithm calculates the Minhash signatures for all the Type 1 clone. Otherwise, the algorithm detects it as a Type 2 clone.
methods by choosing the signature size n as the product of the Finally, the algorithm merges the verified clone pairs into clone
number of b bands in the signature and the number of r rows per classes or groups:
band. The number of bands and rows allows us to establish the 1. It collects all the fragments of the clone pairs in a list and sorts
minimum similarity level needed to screen any pair of methods them by their owner method.
as clone candidates. For each band, it generates a permutation of 2. It identifies the fragments of the same method that overlap
all the shingles’ indices and selects each method’s permuted first r by at least a selected overlap ratio (i.e., 0.6) and merges their
shingles’ indices as its signature part for the band. corresponding clone pairs into a clone class.
The algorithm feeds the methods’ Minhash signatures to a 3. If a clone pair exists whose fragments match the other
Locality-Sensitive Hashing (LSH) step. The LSH step takes b two fragments of the recently merged clone pairs, the
FIGURE 1
High-level illustration of the proposed method.
FIGURE 2
IDE view with all visualization components.
algorithm adds this clone pair to the clone class from the fragments in the IDE. As shown in Figure 2, the IDE includes
previous step. the following:
1. A toolbar that features an interactive graphical visualization of
2.2 Clones visualization and analysis all the clone classes.
2. A toolbar with the list of fragments from the selected clone class
After screening and verifying the clone classes along with on the previous graph in.
their clone fragments, the plugin shows an interactive visual 3. An editor window shows the method containing the selected
representation of the graph assembled by the clones and the fragment from the fragments list in.
FIGURE 3
Clone classes visualization.
The region from Figure 2 is the interactive visualization of all visualization displays a fisheye view of the neighboring clones
the clone classes and shows all the clone groups and their code and some clone information. It highlights the selected clone, its
fragments. The visualization is shown in more detail in Figure 3, fragments, and the arcs that connect the clone class to its fragments
with its four subregions tagged in red: while graying out the rest of the graph. The graph is not grayed
when no class is selected, and the clone classes are sorted either by
1. A bar chart with all the clone classes found. It has a slider to
their Cognitive Complexity and then by their number of fragments
zoom in on a subset of all the clone classes.
or by their number of fragments and then by their Cognitive
2. A zoomed-in subset of clone classes that fisheyes a hovered-
Complexity. Users can use the middle click to switch between both
over clone class and displays information about it (i.e.,
sort orders to make finding clones that need refactoring easier. If
clone identifier, similarity percentage, clone type, cognitive
the developer clicks on a clone class, the highlight of its fragments
complexity, and number of fragments).
remains visible until clicked elsewhere. If the developer right-clicks
3. A region with the arcs that connect the clone classes to
on a clone class (action on Figure 2), the list of its fragments is
their fragments.
shown in the fragments toolbar (see the list in from Figure 2).
4. A bar chart displays all the fragments from the clone classes. It
The fragments toolbar allows users to select two fragments and
allows users to hover over the fragments of the selected clone
show their differences using the diff format (action). We reused
and shows its information (i.e., class signature, method name,
the fragments toolbar provided by the IntelliJ IDEA Community
similarity percentage, clone type, and cognitive complexity).
Edition IDE in their duplicate code search utility. If the developer
The clones visualized in Figure 3 belong to the code base of the double-clicks a fragment from the list (action), it opens the source
JetBrains Open Source IntelliJ Community project on GitHub. The code of that fragment in an editor from Figure 2.
bar chart at the base of the visualization shows all the clone classes The developers can analyze the clone classes to determine
as bars in descending order by their number of fragments in the who needs refactoring. Depending on the development team,
class and their cognitive complexity. The clone bar colors represent the maximum number of acceptable clone fragments in a clone
the clone type: red for Type 1, yellow for Type 2, and green for class ranges from two to ten. We may choose five as an
Type 3. The height of each clone bar is proportional to its cognitive acceptable number of fragments in a clone class before labeling it
complexity. The bar chart has a slider to select and enable zooming for refactoring.
in on a subset of clone classes. The clone classes with cognitive complexity over 15 for any
The visualization shows a zoomed-in subset of clone classes of their fragments can also be labeled for refactoring, as they are
selected with the slider (see the bar chart in from Figure 3). When considered too complex, according to the recommendation of the
the developer selects or hovers over a particular clone class, the open-source platform SonarQube. Future research could provide
3 Validation of clone detection Our method identified 35,136 Java files in the IntelliJ
algorithm Community project and 29,962 methods with more sentences and
tokens than their thresholds. The screening step yielded 8,354 clone
The BigCloneEval framework enables researchers to perform candidate pairs, and the detection step confirmed 7,801 as true
evaluation experiments for clone detection tools using the clones grouped into 591 clone classes. The number of clone classes
BigCloneBench clone benchmark (Svajlenko et al., 2014; Svajlenko with Cognitive Complexity over 15 is 129, while the number of
and Roy, 2021). We applied this benchmark to our clone detection clone classes with more than five fragments is 43. All of these clone
tool. The parameters used for the validation were: -st both - classes highlight candidates for refactoring. Manual inspection is
m ’CoverageMatcher 0.7’ -mis 80 -mil 5 -mip required to determine if refactoring is truly required. The search
5 -mit 40. Our tool reported a high precision of 0.99993, with process took under 7 min.
60,350 true positives and four false positives. It also reported a high The first clone class has 64 fragments, and its Cognitive
recall, as shown in Table 1, and its performance is similar to that of Complexity is 3. All those fragments correspond to methods within
the NiCad6 tool. It took 14:03 min for our tool to detect the clones the same Java class, MantisConnectBindingStub.java,
from the BigCloneBench with 55,499 files, over 15.4M LOC, and and contain boilerplate code for SOAP action calls. The second
460,138 methods over the configured thresholds. Our algorithm clone class has 53 fragments, and its Cognitive Complexity ranges
outperformed NiCad and other tools in speed, as shown in Table 2 from 6 to 36. The corresponding methods implement the equals
(Feng et al., 2020). Feng et al. ran their experiments on an Intel pattern. The third clone class has 47 fragments, and its Cognitive
Core i7-7700K, with 16GB RAM and an SSD, while we ran ours Complexity ranges from 5 to 45. The corresponding methods
on an i7-7700HQ with 16GB RAM and an SSD, which makes the implement the Comparable interface for classes in the same
results comparable. The list of clones found in compressed files, package com.jetbrains.python.console.protocol.
the complete benchmark reports of both the NiCad6 and VizClone We inspected all the clone classes with more than five fragments,
tools, and the speed logs of VizClone can be found in our tool’s and all of them fall into similar categories as the previous ones.
GitHub repository under the bcb folder. The clone class with the highest Cognitive Complexity
value of 117 has five fragments. Its fragments correspond
to the method balanceDeletion for concurrent hash
4 Results map classes inside different packages. The five fragments are
almost identical. Two methods are from classes marked as
We applied our method for visual clone analysis to the source deprecated: com.intellij.util.containers.Con
code of the six open-source projects described in Table 3 and currentIntObjectHashMap and com.intellij.
hosted on GitHub. The IntelliJ IDEA Community Edition and util.containers.ConcurrentLongObjectHashMap,
IntelliJ Platform project (JetBrains, 2024), owned by JetBrains, replaced by two others with the same names but in the
is an open-source IDE for several programming languages. The package com.intellij.concurrency. The fifth
Elasticsearch project (Elastic, 2024), owned by Elastic, is a fragment is from the class com.intellij.concurrency.
distributed search and analytics engine, a scalable data store, ConcurrentHashMap in the same last package. Although
and a vector database optimized for speed and relevance on these fragments are few and it appear unlikely they will
production-scale workloads. The Hadoop project (Apache Software change, it could be worth examining them because of their
TABLE 2 Execution time for VizClone and other tools (Feng et al., 2020).
15.4 M - - - - 14 min 3 s
Project Java files Methods Sim Clones Groups CC > 15 Frag > 5 Time
IntelliJ community 35,136 29,962 8,354 7,801 591 129 43 6:39
high Cognitive Complexity value. The second clone class has classes with more than five fragments are related to parsing,
three fragments, and its Cognitive Complexity ranges between interface implementation, or boilerplate code.
72 and 111. The corresponding methods are in the same class The three clone classes with the highest Cognitive
com.intellij.concurrency.ConcurrentHashMap Complexity have values of 86, 86, and 56, respectively, with
and implement different flavors of remapping an existing each containing two fragments. The first clone class consists
key-value pair. The third clone class has five fragments, and of the method innerFromXContent from the Java classes
its Cognitive Complexity ranges between 103 and 106. The org.elasticsearch.client.tasks.Elasticsearch
corresponding methods come from the same previous classes Exception and org.elasticsearch.Elasticsearch
ConcurrentHashMap, ConcurrentIntObjectHashMap, Exception; this method yields an Elasticsearch
and ConcurrentLongObjectHashMap. The methods Exception based on an XContentParser. The second clone
implement transfer to copy the nodes in each bin to a new class involves the method parseMath found in the Java classes
table. We also inspected all the remaining 126 clone classes with a org.elasticsearch.common.joda.JodaDateMath
Cognitive Complexity above 15, and all implemented boilerplate Parser and org.elasticsearch.common.time.Java
code, pattern interfaces, lexer or parser actions, or protocol actions. DateMath Parser; this method parses mathematical operations
on a date using a time value, a rounding flag, and a timezone,
returning the outcome in milliseconds. The other clone classes
4.2 Elasticsearch project analysis with a Cognitive Complexity exceeding 15 focus on implementing
JSON builders, parsing actions, pattern configurations, sorting
We visually analyzed the Elasticsearch project; Table 3 shows actions, and mathematical operations.
it has 10,313 Java files and 9,846 methods above the thresholds.
It generated 5,996 clone pair candidates from the screening
stage, and the detection stage yielded 624 clone classes and 4.3 Apache Hadoop project analysis
5,140 confirmed clone pairs. There are 49 clone classes with
more than five fragments and 45 with Cognitive Complexity When applying the method to the Apache Hadoop project (see
values above 15. It took approximately 2 min to complete the Table 3), which has 7,343 Java files and 7,563 methods above the
search process. thresholds, the screening step produced 1,037 candidate pairs. In
We skipped the integration test suites interleaved with the contrast, the detection step verified 843 clone pairs grouped into
source code because their methods contain numerous boilerplate 313 clone classes. The number of clone classes with Cognitive
source code patterns identified as clones. Complexity over 15 is 23, and the number of clone classes with
The clone class with more fragments has 111, and its Cognitive more than five fragments is 10. All of these clone classes point
Complexity value ranges from 1 to 23. All the fragments conform out candidates for refactoring. Manual inspection is required
to a pattern that returns a JSON builder for multiple Java classes. to determine whether refactoring is truly necessary. The search
The second clone class has 19 fragments, and its Cognitive process took under 2 min.
Complexity varies between 10 and 44. The methods configure The clone class with more fragments has 35, and its
and return a JSON query builder from a JSON content parser. Cognitive Complexity is 2. All those fragments correspond to
The third clone class has 17 fragments; its Cognitive Complexity methods within the same Java class, FSNamesystem.java,
values range between 1 and 7. The methods are owned by the a container for namespace state, and contain boilerplate code
three parsing Java classes SqlBaseParser, EqlBaseParser, called by HadoopFS clients to modify and query the container.
and PainlessParser; they return different subclasses of The second clone class has 24 fragments, and its Cognitive
ParserRuleContext. Upon inspection, all the other clone Complexity runs between 8 and 31. The corresponding methods
implement the equals pattern. The third clone class has 15 to methods from classes that implement interfaces within the UI
fragments, and its Cognitive Complexity ranges between 1 and components inheritance hierarchy.
3. The corresponding methods implement HTTP requests for the The clone class with the highest Cognitive Complexity has
TimelineReaderWerServices.java class. We inspected seven fragments with complexity values between 8 and 216.
all the remaining clone classes with more than 5 fragments, and all Its fragments correspond to the method doSwitch from
of them fall into similar categories as the previous ones. Java classes in the package org.eclipse.e4.ui.model.
The clone class with the highest Cognitive Complexity application, which implement the Switch for the model’s
value of 60 has two fragments. These fragments correspond inheritance hierarchy. The second clone class has two fragments
to the methods hbMakeCodeLengths on the char array with a Cognitive Complexity of 81. The method implemented
and byte array in the class org.apache.hadoop.io. is processChange from the Java classes org.eclipse.
compress.bzip2.CBZip2OutputStream. These methods text.undo.DocumentUndoManager and org.eclipse.
appear very unlikely to change. The second clone class has two jface.text.DefaultUndoManager, which records
fragments; their Cognitive Complexity values are 50 and 41. changes in a document to be managed. The third clone class also
The methods are processMapAttemptLine and process has two fragments with a Cognitive Complexity of 50. The method
ReduceAttemptLine from the org.apache.hadoop. implemented is refresh from the Java classes org.eclipse.ui.
tools.rumen.HadoopLogsAnalyzer class. The third internal.progress.ProgressInfoItem and org.eclipse.e4.ui.progress.
clone class also has two fragments; their Cognitive Complexity internal.ProgressInfoItem, which refreshes progress updates on
values are 47 and 20. The corresponding methods are named the UI. These classes represent items used to show jobs in the UI.
convertToApplicationAttemptReport and come The remaining 38 clone classes, with Cognitive Complexity over
from the classes org.apache.hadoop.yarn.server. 15, correspond to methods from classes that implement interfaces
applicationhistoryservice.ApplicationHistory under the UI components inheritance hierarchy.
ManagerOnTimelineStore and org.apache.hadoop.
yarn.util.timeline.TimelineEntityV2Converter.
The first method generates a report on the events of the received
4.5 Eclipse platform project analysis
entity, while the second generates a report on the entity’s
information. We inspected the remaining 23 clone classes with
The visual analysis method examined 4,223 Java files in this
a Cognitive Complexity above 15, and the code implements
project, filtered out methods with insufficient sentences or tokens,
boilerplate code, pattern interfaces, and protocol actions.
and produced 4,805 methods for inspection. From the screening
step, it generated 455 candidate pairs for clones, and from the
detection step, it identified 365 verified clones arranged into 217
4.4 Eclipse platform UI project analysis clone classes. As shown in Table 3, it found 33 clone classes with
Cognitive Complexity over 33 and only two clone classes with more
The analyzed revision of this project has 5,669 Java files than five fragments. The search process took 40 s to complete.
and 5,460 methods with sufficient sentences and tokens. The The clone class with the highest Cognitive Complexity
screening stage delivered 862 candidate pairs for clones, and the is the same as the one with the most fragments; it
detection stage verified 650 true clones grouped into 267 clone has eight fragments with Cognitive Complexity ranging
classes. A total of 41 clone classes have a Cognitive Complexity from 47 to 233. All the methods are in the class
exceeding 15, and 11 clone classes have more than five fragments; org.apache.lucene.demo.html.HTMLParserToken
all of these are candidates for refactoring and require manual Manager, which is an HTML parser generated with JavaCC,
inspection to determine whether to factorize. The search took a popular Java parser generator. The second clone class
about 1 min. with six fragments corresponds to the methods named
The top three clone classes with more than 15 fragments startElement, which have Cognitive Complexity values
have 18, 9, and 8, respectively. The first clone class groups from 2 to 5. These methods belong to parsing classes in the package
methods that determine identifiers for different UI components org.eclipse.help.internal.webapp.parser, which
based on context information. Their Cognitive Complexity are specializations of the Java class ResultParser.
varies from 9 to 30, and the related classes are all within the The second clone class, which has a high Cognitive Complexity
package org.eclipse.e4.ui.model.application. value of 57, contains two fragments with Cognitive Complexity
The second clone class contains methods that synchronize values of 48 and 57, respectively. The method names are
extension points across several UI registry Java classes under the findNextPrev, which belong to the Diff TreeViewer classes
package org.eclipse.ui.internal.genericeditor, org.eclipse.team.internal.ui.synchronize.
all of which have a Cognitive Complexity value of 4. The AbstractTreeViewerAdvisor and org.eclipse.
third clone class corresponds to the method eIsSet, compare.structuremergeviewer. DiffTreeViewer.
which returns whether a corresponding feature in the UI The third clone class, with a Cognitive Complexity value
is set for various Java classes within the same package of 45, has two fragments with Cognitive Complexity
org.eclipse.ui.internal.genericeditor; their values of 29 and 45, respectively. The method name is
Cognitive Complexity values range from 3 to 21. The remaining loadContentForExtendedMemoryBlock, associated
eight clone classes, each with more than five fragments, correspond with the Java classesorg.eclipse.debug.internal.ui.
Publisher’s note organizations, or those of the publisher, the editors and the
reviewers. Any product that may be evaluated in this article, or
All claims expressed in this article are solely those of the claim that may be made by its manufacturer, is not guaranteed or
authors and do not necessarily represent those of their affiliated endorsed by the publisher.
References
Alam, A. I., Roy, P. R., Al-Omari, F., Roy, C. K., Roy, B., and Schneider, K. A. (2023). Li, J., Tao, C., Jin, Z., Liu, F., and Li, G. (2023). “ZC3: zero-shot cross-language
“GPTCloneBench: a comprehensive benchmark of semantic clones and cross-language code clone detection,” in Proceedings of the 38th IEEE/ACM International Conference
clones using GPT-3 model and SemanticCloneBench,” in Proceedings of the 39th IEEE on Automated Software Engineering (ASE) (Echternach: IEEE Press), 400–412.
International Conference on Software Maintenance and Evolution (ICSME), 688–699. doi: 10.1109/ASE56229.2023.00210
doi: 10.1109/ICSME58846.2023.00013
Muñoz, M., Wyrich, M., and Wagner, S. (2020). “An empirical validation of
AlOmar, E. A., Mkaouer, M. W., and Ouni, A. (2024). Behind the intent of extract cognitive complexity as a measure of source code understandability,” in ESEM
method refactoring: a systematic literature review. IEEE Trans. Softw. Eng. 50, 668–694. 20: Proceedings of the 14th ACM / IEEE International Symposium on Empirical
doi: 10.1109/TSE.2023.3345800 Software Engineering and Measurement (New York, NY: ACM Digital Library), 1–12.
doi: 10.1145/3382494.3410636
Apache Software Foundation. (2024). Hadoop. Available online at: https://github.
com/apache/hadoop/tree/07627ef19e2bf4c87f12b53e508edf8fee05856a (accessed Needleman, S. B., and Wunsch, C. D. (1970). A general method applicable to the
March 18, 2024). search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48,
443–453. doi: 10.1016/0022-2836(70)90057-4
Assi, M., Hassan, S., and Zou, Y. (2025). Unraveling code clone dynamics in deep
learning frameworks. ACM Trans. Softw. Eng. Methodol. doi: 10.1145/3721125. [Epub Pinku, S. N., Mondal, D., and Roy, C. K. (2024). “On the use of deep learning
ahead of print]. models for semantic clone detection,” in Proceedings of the 40th IEEE International
Conference on Software Maintenance and Evolution (ICSME) (Flagstaff, AZ: IEEE).
Eclipse. (2024a). Eclipse Platform. Available online at: https://github.com/
doi: 10.1109/ICSME58944.2024.00053
eclipse-platform/eclipse.platform/tree/e14565e5022852f2e1eebadbe12d09f4cee88378
(accessed March 18, 2024). Smith, T., and Waterman, M. (1981). Identification of common molecular
subsequences. J. Mol. Biol. 147, 195–197. doi: 10.1016/0022-2836(81)90
Eclipse. (2024b). Eclipse Platform UI. Available online
087-5
at: https://github.com/eclipse-platform/eclipse.platform.ui/tree/
1eead54aac4c037be1bbc08870ccf27aa870cfc8 (accessed March 18, 2024). Spring. (2024). Spring Framework. Available online at: https://github.com/spring-
projects/spring-framework/tree/9f7a94058a4bbc967fe47bfe6a82d88cb3feddfb
Elastic. (2024). Elasticsearch. Available online at: https://github.com/elastic/
elasticsearch/tree/4944959acff48ffa4979c406407d3abcbb371655 (accessed March 19, Sundelin, A., Gonzalez-Huerta, J., Torkar, R., and Wnuk, K. (2025). Governing
2024). the commons: code ownership and code-clones in large-scale software development.
Empir. Softw. Eng. 30, 1–42. doi: 10.1007/s10664-024-10598-7
Feng, C., Wang, T., Liu, J., Zhang, Y., Xu, K., and Wang, Y. (2020). “NiCad+:
speeding the detecting process of NiCad,” in 2020 IEEE International Conference on Svajlenko, J., Islam, J. F., Keivanloo, I., Roy, C. K., and Mia, M. M. (2014).
Service Oriented Systems Engineering (SOSE) (Los Alamitos, CA: IEEE Computer “Towards a big data curated benchmark of inter-project code clones,” in Proceedings
Society), 103–110. doi: 10.1109/SOSE49046.2020.00019 - 30th International Conference on Software Maintenance and Evolution, ICSME
(Victoria, BC: Institute of Electrical and Electronics Engineers Inc.), 476–480.
Feng, S., Suo, W., Wu, Y., Zou, D., Liu, Y., and Jin, H. (2024). “Machine
doi: 10.1109/ICSME.2014.77
learning is all you need: a simple token-based approach for effective code
clone detection,” in: Proceedings of the 46th International Conference on Software Svajlenko, J., and Roy, C. K. (2021). BigCloneBench. Springer Singapore: Singapore,
Engineering (ICSE) (New York, NY: Association for Computing Machinery). 93–105. doi: 10.1007/978-981-16-1927-4_7
doi: 10.1145/3597503.3639114
Thaller, H., Linsbauer, L., and Egyed, A. (2022). “Semantic clone
Gotoh, O. (1982). An improved algorithm for matching biological sequences. J. Mol. detection via probabilistic software modeling,” in Fundamental Approaches
Biol. 162, 705–708. doi: 10.1016/0022-2836(82)90398-9 to Software Engineering (Munich: LNCS), 288–309. doi: 10.1007/978-3-030-
99429-7_16
Hu, T., Xu, Z., Fang, Y., Wu, Y., Yuan, B., Zou, D., et al. (2023). “Fine-grained code
clone detection with block-based splitting of abstract syntax tree,” in Proceedings of Wang, Y., Ye, Y., Wu, Y., Zhang, W., Xue, Y., and Liu, Y. (2023). “Comparison and
the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis evaluation of clone detection techniques with different code representations,” in Proc.
(ISSTA), 89–100. doi: 10.1145/3597926.3598040 of the 45th Int. Conference on Software Engineering: Software Engineering in Practice
(ICSE-SEIP), 84–93. doi: 10.1109/ICSE48619.2023.00039
JetBrains. (2024). IntelliJ IDEA Community Edition. Available
online at: https://github.com/JetBrains/intellij-community/tree/ Xu, Z., Qiang, S., Song, D., Zhou, M., Wan, H., and Zhao, W. (2024). “DSFM:
33034c11b946c8ada97f7cef1590d40b60148e76 (accessed March 19, 2024). enhancing functional code clone detection with deep subtree interactions,” in
Proceedings of the 46th International Conference on Software Engineering (ICSE)
Kanwal, J., Maqbool, O., Basit, H. A., Sindhu, M. A., and Inoue, K. (2022). Historical
(New York, NY: Association for Computing Machinery). doi: 10.1145/3597503.
perspective of code clone refactorings in evolving software. PLoS ONE 17:e0277216.
3639215
doi: 10.1371/journal.pone.0277216
Yahya, M. A., and Kim, D.-K. (2023). Clcd-i: cross-language clone detection by
Kaur, M., and Rattan, D. (2023). A systematic literature review on the
using deep learning with infercode. Computers 12:12. doi: 10.3390/computers1201
use of machine learning in code clone research. Comput. Sci. Rev. 47:100528.
0012
doi: 10.1016/j.cosrev.2022.100528
Khajezade, M., Wu, J. J. W., Fard, F. H., Rodríguez-Pérez, G., and Shehata, Zakeri-Nasrabadi, M., Parsa, S., Ramezani, M., Roy, C. K., and Ekhtiarzadeh, M.
M. S. (2024). “Investigating the efficacy of large language models for code clone (2023). A systematic literature review on source code similarity measurement and
detection,” in Proceedings of the 32nd IEEE/ACM International Conference on Program clone detection: techniques, applications, and challenges. J. Syst. Softw. 204:111796.
Comprehension (ICPC 2024, ERA Track) (New York, NY: Association for Computing doi: 10.1016/j.jss.2023.111796
Machinery). doi: 10.1145/3643916.3645030 Zhang, Z., and Saber, T. (2025a). Exploring the boundaries between llm code clone
Lei, M., Li, H., Li, J., Aundhkar, N., and Kim, D.-K. (2022). Deep learning detection and code similarity assessment on human and ai-generated code. Big Data
application on code clone detection: a review of current knowledge. J. Syst. Softw. Cogn. Comput. 9:41. doi: 10.3390/bdcc9020041
184:111141. doi: 10.1016/j.jss.2021.111141 Zhang, Z., and Saber, T. (2025b). Machine learning approaches to code
Leskovec, J., Rajaraman, A., and Ullman, J. D. (2020). Mining of Massive Datasets. similarity measurement: a systematic review. IEEE Access 13, 51729–51744.
Cambridge University Press, 3 Edition. doi: 10.1017/9781108684163 doi: 10.1109/ACCESS.2025.3553392