Semantic Clone Detection via Probabilistic Software Modeling

Thaller, Hannes; Linsbauer, Lukas; Egyed, Alexander

doi:10.1007/978-3-030-99429-7_16

Computer Science > Software Engineering

arXiv:2008.04891 (cs)

[Submitted on 11 Aug 2020 (v1), last revised 21 May 2022 (this version, v2)]

Title:Semantic Clone Detection via Probabilistic Software Modeling

Authors:Hannes Thaller, Lukas Linsbauer, Alexander Egyed

View PDF

Abstract:Semantic clone detection is the process of finding program elements with similar or equal runtime behavior. For example, detecting the semantic equality between the recursive and iterative implementation of the factorial computation. Semantic clone detection is the de facto technical boundary of clone detectors. In recent years, this boundary has been tested using interesting new approaches. This article contributes a semantic clone detection approach that detects clones that have 0% syntactic similarity. We present Semantic Clone Detection via Probabilistic Software Modeling (SCD-PSM) as a stable and precise solution to semantic clone detection. PSM builds a probabilistic model of a program that is capable of evaluating and generating runtime data. SCD-PSM leverages this model and its model elements for finding behaviorally equal model elements. This behavioral equality is then generalized to semantic equality of the original program elements. It uses the likelihood between model elements as a distance metric. Then, it employs the likelihood ratio significance test to decide whether this distance is significant, given a pre-specified and controllable false-positive rate. The output of SCD-PSM are pairs of program elements (i.e., methods), their distance, and a decision on whether they are clones or not. SCD-PSM yields excellent results with a Matthews Correlation Coefficient greater than 0.9. These results are obtained on classical semantic clone detection problems such as detecting recursive and iterative versions of an algorithm, but also on complex problems used in coding competitions.

Comments:	22 pages, 3 pages of references, 4 listings, 2 figures, 3 tables
Subjects:	Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2008.04891 [cs.SE]
	(or arXiv:2008.04891v2 [cs.SE] for this version)
	https://doi.org/10.48550/arXiv.2008.04891
Journal reference:	Johnsen, E.B., Wimmer, M. (eds) Fundamental Approaches to Software Engineering. FASE 2022. Lecture Notes in Computer Science, vol 13241. Springer, Cham
Related DOI:	https://doi.org/10.1007/978-3-030-99429-7_16

Submission history

From: Hannes Thaller [view email]
[v1] Tue, 11 Aug 2020 17:54:20 UTC (710 KB)
[v2] Sat, 21 May 2022 15:55:34 UTC (1,136 KB)

Computer Science > Software Engineering

Title:Semantic Clone Detection via Probabilistic Software Modeling

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Software Engineering

Title:Semantic Clone Detection via Probabilistic Software Modeling

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators