0% found this document useful (0 votes)

21 views

MADE-WIC Multiple Annotated Datasets for Exploring Weaknesses In Code

The paper presents MADE-WIC, a comprehensive dataset containing approximately 860,000 code functions and over 2.7 million comments from 12 open-source projects, with multiple annotations for technical debt and code weaknesses. This dataset aims to facilitate research by providing a standardized resource for evaluating tools that detect code vulnerabilities and weaknesses, addressing issues related to existing datasets' annotation biases. MADE-WIC is publicly available, allowing for further research and extension in the field of software engineering.

Uploaded by

eliane.matheus

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

MADE-WIC Multiple Annotated Datasets for Exploring Weaknesses In Code

Uploaded by

eliane.matheus

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

MADE-WIC: Multiple Annotated Datasets for Exploring

Weaknesses In Code
Moritz Mock Jorge Melegati Max Kretschmann
Free University of Bozen-Bolzano Free University of Bozen-Bolzano Hamburg University of Technology
Bozen-Bolzano, Italy Bozen-Bolzano, Italy Hamburg, Germany
[email protected] [email protected] [email protected]

Nicolás E. Díaz Ferreyra Barbara Russo

Hamburg University of Technology Free University of Bozen-Bolzano
Hamburg, Germany Bozen-Bolzano, Italy
arXiv:2408.05163v1 [cs.SE] 9 Aug 2024

[email protected] [email protected]

ABSTRACT with heuristics derived for the available data or by statistic ana-
In this paper, we present MADE-WIC, a large dataset of func- lyzers, which are not always accurate [10]. Differences in schema
tions and their comments with multiple annotations for technical and annotation may prevent study replication and generalization.
debt and code weaknesses leveraging different state-of-the-art ap- This is, for instance, the case of datasets annotated for vulnerability
proaches. It contains about 860K code functions and more than 2.7M detection, for which literature reports several issues. Vulnerabil-
related comments from 12 open-source projects. To the best of our ity datasets often rely on human-labelled techniques (e.g., commit
knowledge, no such dataset is publicly available. MADE-WIC aims differential analysis) that are resource-intensive [22]. A significant
to provide researchers with a curated dataset on which to test and number (more than 60%) of vulnerabilities miss any annotation
compare tools designed for the detection of code weaknesses and in practice, for example, due to silent fixes (i.e., developers com-
technical debt. As we have fused existing datasets, researchers mit changes to fix vulnerabilities but do not label/report the com-
have the possibility to evaluate the performance of their tools by mits [16]), which implies that the actual number of vulnerabilities
also controlling the bias related to the annotation definition and is much more than what it can be. Automated labelling results in a
dataset construction. The demonstration video can be retrieved at high percentage of incorrectly labelled vulnerabilities [16]. Finally,
https://www.youtube.com/watch?v=GaQodPrcb6E. the different types of annotation techniques may produce different
annotated datasets and detection results [5]. Another example are
CCS CONCEPTS techniques to annotate self-admitted technical debt (SATD), i.e.,
annotate comments and related code (i.e., technical debt) that has
• Software and its engineering → Maintaining software.
low quality and requires future effort for refactoring [21]. Fig. 1
KEYWORDS SATD

Dataset annotation, SATD, security, vulnerabilities 1600

2.09%
1400

ACM Reference Format: 1200

Moritz Mock, Jorge Melegati, Max Kretschmann, Nicolás E. Díaz Ferreyra, 1000 1.43%

and Barbara Russo. 2018. MADE-WIC: Multiple Annotated Datasets for Ex- 800
5.94%

ploring Weaknesses In Code. In Proceedings of 39th IEEE/ACM International 600

4.06%
3.24% 1.86% 3.44%
Conference on Automated Software Engineering (ASE 2024). ACM, New York, 400
0.60% 1.15%
1.51%
0.89%
1.40%
0.73%
1.04%

NY, USA, 4 pages. https://doi.org/XXXXXXX.XXXXXXX

200 0.47% 0.61% 0.38%
0.29% 0.41% 0.43%

0
A

SQ
pa

re
di

ub
lu

ui
F

er
oU
ch

eC
t
m

rre
n
e

ba
M

l
A

1 INTRODUCTION
e
L

rt
nt

Guo et al Ren et al. and Maldonado Shihab

Existing public datasets on software data typically provide their

own schema and annotation method for a specific learning task (e.g., Figure 1: Differences in number and percentage of SATD
detection of code weaknesses). The schema is often determined by instances in Ren et al. [20] and Guo et al. [9] on the same set
the goal of the research, and labels are typically generated manually, of comments [12]

Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed illustrates the difference in size of the same datasets annotated with
for profit or commercial advantage and that copies bear this notice and the full citation the same technique by two different authors Guo et al. [9] and Ren
on the first page. Copyrights for components of this work owned by others than ACM et al. [20]. The figure reports the number of SATD instances (y-axis)
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a and the percentage of SATD in the respective project. For instance,
fee. Request permissions from [email protected]. there is a difference of 2.5% in SATD instances for JRuby project.
ASE 2024, October 2024, Sacramento, California, United States Differences in datasets’ construction and annotation can produce
© 2018 Association for Computing Machinery.
ACM ISBN 978-1-4503-XXXX-X/18/06. . . $15.00 different detection performances, as shown in Fig. 2. In this paper,
https://doi.org/XXXXXXX.XXXXXXX we apply data fusion [1] to three existing datasets (WeakSATD [21],
ASE 2024, October 2024, Sacramento, California, United States Moritz Mock, Jorge Melegati, Max Kretschmann, Nicolás E. Díaz Ferreyra, and Barbara Russo

Big-Vul: The annotation method used in Big-Vul [6] parses the in-
1

0.9

0.8 formation in the Common Vulnerability Enumeration (CVE) repos-

0.7
itory [14] to retrieve the CVE entries that have reference links to
0.6

0.5
publicly available Git repositories (e.g., GitHub) and their related
0.4 bugs. From the related BugID, the fix commits and the previous
0.3
versions of changed functions are retrieved and labelled vulnerable.
0.2

0.1
W: The W annotation uses heuristics extracted from the MITRE
0
Apache Ant ArgoUML Columba EMF Hibernate JEdit JFreeChart JMeter JRuby SQuirrel
Common Weakness Enumeration repository (CWE) [13], as pro-
Precision G Precision R Recall G Recall R F1 G F1 R
posed in WeakSATD approach [21]. These are a set of rules that
implement the descriptions provided in the CWE reports. We used
Figure 2: Differences in performance reported by Guo et al. of the rules available in the replication package of WeakSATD [21].
the Ren et al. approach [9] and in the original work of Ren
et al. [20] 2.2 Function Annotation through Comments
PS: The PS annotation labels functions as technical debt if they
have at least one comment annotated with one of the 64 SATD
Devign [23], and Big-Vul [6]) and build MADE-WICthereafter, a patterns [19]. The patterns have been identified by manually in-
novel curated dataset of functions and comments (including leading specting more than 100,000 Java comments.
comments, i.e., those comments preceding and related to a func- MAT: The MAT annotation leverages the Matches task Annota-
tion), labelled for vulnerability, technical debt and security concerns. tion Tags (MAT) heuristics [9] to label functions as technical debt
The dataset provides a unique schema and different annotations if they have at least one comment annotated with one MAT tag
for the same instances and the above attributes. MADE-WIC con- TODO, FIXME, XXX, and HACK. MAT tags have been identified by
tains about 860k functions and 2.7M of comments from 12 projects. exploring the default syntax highlighting of different IDEs.
We also propose an approach for the construction and annotation SecI: The SecI annotation first automatically label functions as vul-
compliant with the schema of MADE-WIC that enables extension nerable if they have at least one comment annotated with one of the
to further projects. The dataset, including the code to create it, is 288 security indicators [3]. Then, functions and their annotations
publicly available in the replication package [15]. have been manually reviewed by three of the authors, achieving a
Fleiss’ Kappa [8] score of 0.735 for their agreement. The result is a
2 ANNOTATION APPROACHES set of 89 agreed security indicators.
We implemented different techniques with which we labelled the
functions of MADE-WIC either directly or through related com- 3 DATASET FUSION
ments as described below. MADE-WIC fuses and extends two existing datasets Devign [23],
and Big-Vul [6] by creating a unique schema that transforms the
2.1 Function Annotations data present in the sources into a common representation. The
vf: We indicate with vf the original annotations of Devign and selected datasets and projects use C/C++, provide functions and
Big-Vul as described in the following. comments or references to the GitHub projects from which to
Devign: The Devign approach [23] leverages a list of security-related extract them, and make their annotation approach publicly available.
keywords in commits’ messages to classify commits as vulnerable, The resulting dataset consists of functions and their comments
collects the previous version of the functions changed in the com- (internal and leading comments). Fig. 3 illustrates the overall fusion
mits and then manually annotates them for vulnerability. and annotation approach as described in the following sections.

Rules
Chromium
Extract functions with
Linux Kernel Remove duplicates Annotate Functions
Leading Comments
Mozilla FireFox
Projectname,
MADE-WIC
Rule based
Commit-ID, Annotated
Filepath, Function
Function
Projects 1-11 CSV-Files

CSV-Files
Patterns
Devign Extract Leading
Extract features Annotate Comments
Big-Vul Comments
Projectname,
Commit-ID, Leading- Annotated
Filepath, Comment Comments
Annotated
Function

Figure 3: Fusion approach to generate MADE-WIC, extracting the information from existing datasets and open source projects.
MADE-WIC: Multiple Annotated Datasets for Exploring Weaknesses In Code ASE 2024, October 2024, Sacramento, California, United States

Table 1: MADE-WIC schema. annotation as described in Section 2.1. The annotations Big-Vul
and Devign are extracted from the original datasets. The attributes
Name Type Description Project-name, Commit-ID, and Filepath can be used to verify or ex-
tend in the future MADE-WIC. Table 2 describes the three CSV files
Project- string Project from which the function was
compounding MADE-WIC. For each file, we report: (fn) the num-
name extracted
ber of functions, (vf) the number of vulnerable functions as in the
Commit- string Commit hash from which the function
original dataset, (W) the number of vulnerable functions annotated
ID was extracted
Data

with W, and the number of functions annotated as technical debt

Filepath string Path of the file that contains the func-
(with PS and MAT) and as security concerns (SecI). In particular, the
tion in the selected commit
annotations W, PS, MAT, and SecI are performed on all data in the
Function string Function code
dataset, see Table 2. Thus, the same schema and annotations allow
Leading- string Comment preceding and related to the
for a comparison of existing datasets like Devign and BigVul that
Comment function
controls bias related to the construction of the annotated datasets
PS boolean Flag indicating PS annotation when they are used for vulnerability, technical debt, or security
MAT boolean Flag indicating MAT annotation
Annotation

detection.
Big-Vul boolean Flag indicating Big-Vul annotation
Devign boolean Flag indicating Devign annotation
W boolean Flag indicating W annotation 3.3 Application Scenarios
SecI boolean Flag indicating SecI annotation Our dataset can be used for various classification tasks that involve
functions and/or their comments. For instance, it can be used to
fine-tune pre-trained deep learning models for text and code (e.g.,
3.1 Data Extraction: BERT-based transformers such as CodeBERT [7]) for downstream
To extract the data, we implemented and automated two processes: tasks that classify functions as technical debt and/or vulnerable by
1) from the annotated datasets Devign and BigVul and 2) from the simply exploiting MADE-WIC with the PS annotation for technical
open-source repositories starting from the Chromium project used debt and W for weakness. Studies can also understand the impact
in WeakSATD and extended it to Linux Kernel, Mozilla FireFox. of different annotation techniques on the same data by, for instance,
From annotated datasets. From Devign and BigVul, we extracted comparing MADE-WIC on vf and W, or the different subsets of
the features (Project Name, Commit-ID, File Name, Annotation, MADE-WIC, Devign and BigVul, with their original vf annotations.
Function) and used the commit-ID to get the project version. We MADE-WIC can also be used for the summarization of specific com-
then used srcML [2] to obtain any leading comment of the functions. ments (e.g., generating SATD or security-related comments from
Given that retrieving the leading comments is time-consuming, and functions) and masking tasks on functions and/or comments. For
Big-Vul is a very large dataset, we choose the ten projects with instance, researchers can mask PS patterns in comments or W pat-
the largest number of vulnerabilities, accounting for 75% of the terns in functions and compare the ability of different transformers
total. The projects are listed in Table 2. From the table, we can also to retrieve back the patterns.
see that three of the projects (Linux, Chrome, and FFmpeg) are
shared between Devign and BigVul, but the different extraction and 3.4 Data Quality
annotation techniques make them different in size and composition.
In this section, we leverage the work of Croft et al. [4] to assess the
From open source public repositories. From open source projects
quality of MADE-WIC. The paper provides a set of attributes for
(Chromium, Linux Kernel, Mozilla FireFox), we cloned the repos-
high-quality of software vulnerability datasets.
itories at the last commit 1 and extract with srcML the features
Accuracy - The degree to which the data has attributes that correctly
(FileName, Annotation, Function, and leading comment, if any).
represent the true value of the intended attribute of a concept or event.
Finally, in both processes, we removed duplicates using PMD-
To ensure accuracy, we first decided to use the state-of-the-art an-
CPD [18], which tokenizes the functions and calculates their sim-
notations. Then, we manually reviewed 31 functions annotated
ilarity based on a threshold of 30 tokens. The threshold balances
as vulnerable by W (one function per heuristic of the W annota-
computational effort with the thoroughness of the inspection and
tion) and their comments from the OSPR portion. Each of the 31
includes 85% of all functions. Then, we annotated the functions
functions was independently verified by three authors. In the end,
in multiple ways, either directly or through the comments as de-
6 heuristics have been tuned and 10 removed. The authors then
scribed in Section 2.1. The annotated functions and their comments
agreed on 21 rules and corresponding vulnerable functions.
are then stored in the replication package in three CSV files called
Uniqueness - The degree to which there is no duplication in records.
OSPR, BigVul, and Devign.
To remove duplicated functions, we used PMD-CPD [18] and a 99%
threshold for the Jaccard index of overlapping function code. This
3.2 Schema process was carried out for each of the datasets, both for individual
Table 1 illustrates the schema resulting from the fusion: each row projects and between different projects. In the end, we removed
describes a column of a CSV file of MADE-WIC. The schema con- 127 duplicates.
tains eleven attributes. The annotations’ flag indicates the type of Consistency - The degree to which data instances have attributes that
1 commit hash 57f97b2 for Chromium, e2ca6ba for Linux Kernel, and 4d46db3ff28b for are free from contradiction and are coherent with other instances.
Mozilla Firefox. As there is no oracle for the attributes we considered in this work,
ASE 2024, October 2024, Sacramento, California, United States Moritz Mock, Jorge Melegati, Max Kretschmann, Nicolás E. Díaz Ferreyra, and Barbara Russo

Table 2: Functions (including their comments) in MADE-WIC: Chromium functions and comments from the original source. Not
total (fn), vulnerability annotated as in original datasets (vf), all the datasets in the literature could also be fused with our ap-
vulnerability annotated as weaknesses (W), technical debt proach. For instance, the dataset of Lin et al. [11] that can be found
annotated as (PS) and (MAT), and security annotated as (SecI). in the replication package by Nong et al. [17] does not provide the
commit-IDs from which the functions were extracted, preventing
Project fn vf W PS MAT SecI researchers from retrieving any further data than the ones they
OSPR, language: C published, e.g., leading comment. D2A [22] is a dataset that lever-
Chromium 20,028 - 5,205 151 508 514 ages six open-source projects at function-level granularity. The
Linux Ker. 652,726 - 210,220 5,594 8,444 13,182 annotation was done using a tool-based approach, focusing on
Mozilla Fir. 15,380 - 4,200 116 436 141 the function versions before and after a fixing commit. Due to its
Total 688,134 - 219,625 5,861 9,388 13,837 large size of 3.7 GB, which includes more than 1.2 million instances,
Devign, language: C integrating this extensive dataset will be addressed in future work.
FFmpeg 9,738 4,961 7,321 1,032 1,366 732
Acknowledgments. Moritz Mock is partially funded by the Na-
Qemu 17,544 7,476 7,785 477 1,378 653
tional Recovery and Resilience Plan (Piano Nazionale di Ripresa e
Total 27,282 12,437 15,106 1,509 2,744 1,385
Resilienza, PNRR - DM 117/2023). The work has been funded by
Big-Vul, language: C, C++ project no. EFRE1039 under the 2023 EFRE/FESR program.
Android 8,671 1,267 3,598 53 184 316
Chromium 77,167 3,938 10,434 165 569 334 REFERENCES
FFmpeg 1,925 114 1,201 70 99 79 [1] Jens Bleiholder and Felix Naumann. 2009. Data fusion. ACM Comput. Surv. 41, 1,
File(1) comm. 294 49 207 12 7 0 Article 1 (jan 2009), 41 pages.
ImageMagick 2,489 338 1,271 12 11 91 [2] Michael L. Collard, Michael John Decker, and Jonathan I. Maletic. 2013. srcML:
An Infrastructure for the Exploration, Analysis, and Manipulation of Source
Kerberos5 832 140 478 15 31 187 Code: A Tool Demonstration (ICSME). 516–519.
Linux Ker. 46,828 1,955 18,298 768 886 2,652 [3] Roland Croft et al. 2022. An empirical study of developers’ discussions about
security challenges of different programming languages. Empirical Software
PHP Interp. 2,669 364 1,580 59 88 173 Engineering 27 (2022), 1–52.
Radare2 1,168 73 722 19 67 19 [4] Roland Croft, M. Ali Babar, and M. Mehdi Kholoosi. 2023. Data Quality for
Tcpdump 778 210 532 64 110 77 Software Vulnerability Datasets (ICSE). 121–133.
[5] Roland Croft, Yongzheng Xie, and Muhammad Ali Babar. 2022. Data prepara-
Total 144,358 8,448 38,214 1,237 2,042 3,928 tion for software vulnerability prediction: A systematic literature review. IEEE
Grand Total 859,774 20,885 272,945 8,607 14,174 19,150 Transactions on Software Engineering 49, 3 (2022), 1044–1063.
[6] Jiahao Fan, Yi Li, Shaohua Wang, and Tien N. Nguyen. 2020. A C/C++ Code Vul-
nerability Dataset with Code Changes and CVE Summaries (MSR ’20). Association
for Computing Machinery, New York, NY, USA, 508–512.
the different annotation techniques for one attribute (e.g., vulner- [7] Zhangyin Feng et al. 2020. CodeBERT: A pre-trained model for programming
ability) have reported a portion of functions with positive and and natural languages. EMNLP 2020 (2020), 1536–1547.
[8] Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters.
negative annotation depending on the technique. For instance, the Psychological bulletin 76, 5 (1971), 378.
percentage of functions that are consistently annotated vf and W [9] Zhaoqiang Guo et al. 2021. How Far Have We Progressed in Identifying Self-
ranges between 8% and 45%, while for technical debt (PS and MAT Admitted Technical Debts? A Comprehensive Empirical Study.
[10] Kim Herzig and Andreas Zeller. 2013. The impact of tangled code changes (MSR).
annotation), it ranges between 0% and 68% over individual projects. IEEE, 121–130.
Completeness - The degree to which subject data associated with an [11] Guanjun Lin, Wei Xiao, Jun Zhang, and Yang Xiang. 2020. Deep Learning-Based
entity has values for all expected attributes and related instances. Vulnerable Function Detection: A Benchmark. In Information and Communi-
cations Security, Jianying Zhou, Xiapu Luo, Qingni Shen, and Zhen Xu (Eds.).
We provided multiple annotations for all attributes of interest (vul- Springer International Publishing, Cham, 219–232.
nerability, technical debt, security concerns). Some of the OSPR [12] Everton da Silva Maldonado, Emad Shihab, and Nikolaos Tsantalis. 2017. Using
Natural Language Processing to Automatically Detect Self-Admitted Technical
projects miss the Big-Vul or Devign original annotations and will Debt. IEEE Transactions on Software Engineering 43, 11 (2017), 1044–1062.
be matter of future work. [13] MIRTE. 2006. Common Weakness Enumaration. https://cwe.mitre.org
Currentness - The degree to which data has attributes that are of the [14] MITRE. 1999. Common Vulnerabilities and Exposures. https://cve.mitre.org
[15] Moritz Mock, Jorge Melegati, et al. 2024. Replication package. https://doi.org/10.
right age. The OSPR data was extracted and annotated in 2023 5281/zenodo.12567874
with the exception of the Chromium project, whose functions [16] Giang Nguyen-Truong et al. 2022. HERMES: Using Commit-Issue Linking to
were extracted from the WeakSATD repository [21]. The Devign Detect Vulnerability-Fixing Commits (SANER). 51–62.
[17] Yu Nong et al. 2023. Open Science in Software Engineering: A Study on Deep
and Big-Vul datasets have been extracted as-is from the original Learning-Based Vulnerability Detection. IEEE Transactions on Software Engineer-
sources [6, 23]. ing 49, 4 (2023), 1983–2005.
[18] PMD 2024. PMD-CPD. https://pmd.github.io/pmd/pmd_userdocs_cpd.html
[19] Aniket Potdar and Emad Shihab. 2014. An Exploratory Study on Self-Admitted
4 RELATED WORK AND CONCLUSIONS Technical Debt (ICSME). 91–100.
[20] Xiaoxue Ren et al. 2019. Neural Network-Based Detection of Self-Admitted
To the best of our knowledge, MADE-WIC is the first dataset that Technical Debt: From Performance to Explainability. ACM Trans. Softw. Eng.
provides functions, all relevant comments and their annotations Methodol. 28, 3, Article 15 (jul 2019), 45 pages.
for technical debt, vulnerability and security concerns. The three [21] Barbara Russo, Matteo Camilli, and Moritz Mock. 2022. WeakSATD: Detecting
Weak Self-admitted Technical Debt (MSR). 448–453.
datasets we fused in this study (OSPR, Devign, Big-Vul) have single [22] Yunhui Zheng et al. 2021. D2a: A dataset built for ai-based vulnerability detection
annotations or different annotation methods. For instance, Weak- methods using differential analysis (ICSE-SEIP). IEEE, 111–120.
SATD [21] annotates for SATD files of the Chromium project with [23] Yaqin Zhou et al. 2019. Devign: Effective Vulnerability Identification by Learning
Comprehensive Program Semantics via Graph Neural Networks. Curran Associates.
PS. In our work, we have reviewed this approach and extracted

2_5453903112030928288
No ratings yet
2_5453903112030928288
1 page
Driller: Augmenting Fuzzing Through Selective Symbolic Execution
No ratings yet
Driller: Augmenting Fuzzing Through Selective Symbolic Execution
16 pages
Kim Et Al. - 2016 - Software Vulnerability Detection Methodology Combined With Static and Dynamic Analysis
No ratings yet
Kim Et Al. - 2016 - Software Vulnerability Detection Methodology Combined With Static and Dynamic Analysis
17 pages
Usenixsecurity23 Mirsky
No ratings yet
Usenixsecurity23 Mirsky
19 pages
Crossproject Transfer Representation Learning For Vulnerable Fun 2018
No ratings yet
Crossproject Transfer Representation Learning For Vulnerable Fun 2018
9 pages
Automated Vulnerability Detectionin Source Code Using Deep Representation Learning
No ratings yet
Automated Vulnerability Detectionin Source Code Using Deep Representation Learning
7 pages
Usage of Machine Learning in Software Testing: Sumit Mahapatra and Subhankar Mishra
No ratings yet
Usage of Machine Learning in Software Testing: Sumit Mahapatra and Subhankar Mishra
16 pages
Machine Learning For Source Code Vulnerability Detection: What Works and What Isn't There Yet
No ratings yet
Machine Learning For Source Code Vulnerability Detection: What Works and What Isn't There Yet
17 pages
LineVul A Transformer-Based Line-Level Vulnerability Prediction
No ratings yet
LineVul A Transformer-Based Line-Level Vulnerability Prediction
13 pages
Source Code Audit PDF
No ratings yet
Source Code Audit PDF
16 pages
Buffer Overflow
No ratings yet
Buffer Overflow
12 pages
sensors-23-07978-v2
No ratings yet
sensors-23-07978-v2
33 pages
Automated Identification of Security Issues From Commit Messages and Bug Reports
No ratings yet
Automated Identification of Security Issues From Commit Messages and Bug Reports
6 pages
Cleanvul: Automatic Function-Level Vulnerability Detection in Code Commits Using LLM Heuristics
No ratings yet
Cleanvul: Automatic Function-Level Vulnerability Detection in Code Commits Using LLM Heuristics
25 pages
Usage of Machine Learning in Software Testing
No ratings yet
Usage of Machine Learning in Software Testing
15 pages
DOC-20250213-WA0003_250213_182959
No ratings yet
DOC-20250213-WA0003_250213_182959
19 pages
Dissertation Kisserli
No ratings yet
Dissertation Kisserli
96 pages
Paper 1
No ratings yet
Paper 1
13 pages
AIBug Hunter
No ratings yet
AIBug Hunter
34 pages
7518
No ratings yet
7518
8 pages
The Rise of Software Vulnerability Taxonomy of Sof - 2021 - Journal of Network
No ratings yet
The Rise of Software Vulnerability Taxonomy of Sof - 2021 - Journal of Network
24 pages
Li 2021
No ratings yet
Li 2021
16 pages
Pattern Based Vulnerability Discovery
No ratings yet
Pattern Based Vulnerability Discovery
151 pages
PS 30 - XianDu 2020
No ratings yet
PS 30 - XianDu 2020
7 pages
Fulltext01 2
No ratings yet
Fulltext01 2
45 pages
Towards An Improved Understanding of Software Vulnerability Assessment Using Data-Driven Approaches-Le2022 - PHD
No ratings yet
Towards An Improved Understanding of Software Vulnerability Assessment Using Data-Driven Approaches-Le2022 - PHD
176 pages
iTES Integrated Testing and Evaluation System For Software Vulnerability Detection Methods
No ratings yet
iTES Integrated Testing and Evaluation System For Software Vulnerability Detection Methods
6 pages
2412.16607v1
No ratings yet
2412.16607v1
6 pages
Base Paper
No ratings yet
Base Paper
12 pages
Deep Learning for Software Defect Prediction- A Survey
No ratings yet
Deep Learning for Software Defect Prediction- A Survey
6 pages
2309.15324v1
No ratings yet
2309.15324v1
13 pages
When A Patch Goes Bad Exploring The Properties of
No ratings yet
When A Patch Goes Bad Exploring The Properties of
10 pages
Using ML and Data-Mining Techniques in Automatic Vulnerability Software Discovery
No ratings yet
Using ML and Data-Mining Techniques in Automatic Vulnerability Software Discovery
18 pages
IJSEVol7_No2_3
No ratings yet
IJSEVol7_No2_3
28 pages
SSE Co-3
No ratings yet
SSE Co-3
74 pages
Devpos
No ratings yet
Devpos
13 pages
Paper Submission Example From Peergrade
No ratings yet
Paper Submission Example From Peergrade
6 pages
QNLP
No ratings yet
QNLP
20 pages
Oddf: Discovering Java Deserialization Vulnerabilities Via Structure-Aware Directed Greybox Fuzzing
No ratings yet
Oddf: Discovering Java Deserialization Vulnerabilities Via Structure-Aware Directed Greybox Fuzzing
18 pages
2024issta - Static Application Security Testing (SAST) Tools For Smart Contracts - How Far Are We?
100% (1)
2024issta - Static Application Security Testing (SAST) Tools For Smart Contracts - How Far Are We?
24 pages
software defect prediction ppr
No ratings yet
software defect prediction ppr
11 pages
Software Vulnerability Analysis and Discovery Using Deep Learning Techniques A Survey
No ratings yet
Software Vulnerability Analysis and Discovery Using Deep Learning Techniques A Survey
15 pages
1 s2.0 S0167404822004096 Main
No ratings yet
1 s2.0 S0167404822004096 Main
11 pages
Automated_Software_Vulnerability_Assessment_with_Concept_Drift
No ratings yet
Automated_Software_Vulnerability_Assessment_with_Concept_Drift
12 pages
Development of A Software Security Assessment Instrument To Redu
No ratings yet
Development of A Software Security Assessment Instrument To Redu
6 pages
1432-1-5080-1-10-20240617
No ratings yet
1432-1-5080-1-10-20240617
13 pages
Vulnerability Scanners-A Proactive Approach To Ass
No ratings yet
Vulnerability Scanners-A Proactive Approach To Ass
13 pages
DF Lab Manual Sem 8
No ratings yet
DF Lab Manual Sem 8
53 pages
s10207-024-00882-4
No ratings yet
s10207-024-00882-4
17 pages
Software Vulnerability Prediction Using Text Analysis Techniques
No ratings yet
Software Vulnerability Prediction Using Text Analysis Techniques
3 pages
2020Typestate-Guided Fuzzer For Discovering Use-After-free Vulnerabilities
No ratings yet
2020Typestate-Guided Fuzzer For Discovering Use-After-free Vulnerabilities
14 pages
Predicting Bad Commits: Finding Bugs by Learning Their Socio-Organizational Patterns
No ratings yet
Predicting Bad Commits: Finding Bugs by Learning Their Socio-Organizational Patterns
8 pages
Cleaning The NVD Comprehensive Quality Assessment Improvements and Analyses
No ratings yet
Cleaning The NVD Comprehensive Quality Assessment Improvements and Analyses
15 pages
A Deep Learning Based Static Taint Analysis Approach
No ratings yet
A Deep Learning Based Static Taint Analysis Approach
40 pages
2308.14434v1
No ratings yet
2308.14434v1
8 pages
Sunaan
No ratings yet
Sunaan
28 pages
Software Design Level Vulnerability Classification Model
No ratings yet
Software Design Level Vulnerability Classification Model
18 pages
Sunaan
No ratings yet
Sunaan
28 pages
Abstracts For Analyzing With Answers 1-16
No ratings yet
Abstracts For Analyzing With Answers 1-16
9 pages
applsci-14-09697
No ratings yet
applsci-14-09697
14 pages
Defect Prediction in Software Development & Maintainence
From Everand
Defect Prediction in Software Development & Maintainence
Rudra Kumar
No ratings yet
Certiport - Marketing Resource Library
No ratings yet
Certiport - Marketing Resource Library
1 page
0 inbox] [TR] [[email protected]
No ratings yet
0 inbox] [TR] [[email protected]
2 pages
inbox] [US] [[email protected]]
No ratings yet
inbox] [US] [[email protected]]
1 page
[44 inbox] [IN] [[email protected]]
No ratings yet
[44 inbox] [IN] [[email protected]]
1 page
AE_[0 inbox] [AE] [[email protected]]
No ratings yet
AE_[0 inbox] [AE] [[email protected]]
2 pages
1.157 hotmail
No ratings yet
1.157 hotmail
24 pages
comptia-a-1102-practice
100% (1)
comptia-a-1102-practice
4 pages
clarotvmais.com.br_Eustáquio
No ratings yet
clarotvmais.com.br_Eustáquio
29 pages
Paypal.com Polat
No ratings yet
Paypal.com Polat
2,002 pages
skred_712008e
No ratings yet
skred_712008e
2 pages
wise.com_289281d
No ratings yet
wise.com_289281d
6 pages
practice-tests-comptia-a-1101-practice
No ratings yet
practice-tests-comptia-a-1101-practice
3 pages
cobasi.com.br_65626
No ratings yet
cobasi.com.br_65626
5 pages
Expert.brasil.adp.Com 863953sss
No ratings yet
Expert.brasil.adp.Com 863953sss
18 pages
973 FULL VALID HOTMAIL MAIL ACCESS
No ratings yet
973 FULL VALID HOTMAIL MAIL ACCESS
17 pages
2_5451651312217240305
No ratings yet
2_5451651312217240305
1 page
Https Portalservicos.usp.Br Lara
No ratings yet
Https Portalservicos.usp.Br Lara
1 page
0011 Fatura de Outubro
No ratings yet
0011 Fatura de Outubro
3 pages
@InfernoUrl [MAIL PASS] FREE#138
No ratings yet
@InfernoUrl [MAIL PASS] FREE#138
69 pages
2_5453903112030927585
No ratings yet
2_5453903112030927585
27 pages
Results Hide Abcproxy
No ratings yet
Results Hide Abcproxy
1 page
Fatura OUT SP
No ratings yet
Fatura OUT SP
3 pages
Hotmail (3.447)
No ratings yet
Hotmail (3.447)
59 pages
Supply-Chain-Due-Diligence-Researchers-Guide
No ratings yet
Supply-Chain-Due-Diligence-Researchers-Guide
118 pages
5gbet.com_299152
No ratings yet
5gbet.com_299152
1 page
Fatura Paulo Ed
No ratings yet
Fatura Paulo Ed
3 pages
XSCSD
No ratings yet
XSCSD
2 pages
Name Correction Form New
No ratings yet
Name Correction Form New
1 page
Chapter 7 Python Fundamental Class 11
No ratings yet
Chapter 7 Python Fundamental Class 11
12 pages
Cloud Enablement
No ratings yet
Cloud Enablement
107 pages
Create Table Function in SQL Using SAP HANA Studio PDF
No ratings yet
Create Table Function in SQL Using SAP HANA Studio PDF
7 pages
OVMS UserGuide KiaSoulEV PDF
No ratings yet
OVMS UserGuide KiaSoulEV PDF
19 pages
abrites-programmer-user-manual
No ratings yet
abrites-programmer-user-manual
15 pages
UNIT III - OPCON - LECTURE 22 - Stability Compensation
No ratings yet
UNIT III - OPCON - LECTURE 22 - Stability Compensation
14 pages
Script Gringo Cicle Pro
100% (3)
Script Gringo Cicle Pro
8 pages
Prolog programming for artificial intelligence 4ed. Edition Ivan Bratko 2024 Scribd Download
100% (6)
Prolog programming for artificial intelligence 4ed. Edition Ivan Bratko 2024 Scribd Download
51 pages
ModabberniaIJCSE13 05 02 041
No ratings yet
ModabberniaIJCSE13 05 02 041
14 pages
Soundbox390 User Manual
No ratings yet
Soundbox390 User Manual
4 pages
An FPGA-Based Reconfigurable CNN Accelerator For YOLO
No ratings yet
An FPGA-Based Reconfigurable CNN Accelerator For YOLO
5 pages
Visual Communication Desing: First Grade
No ratings yet
Visual Communication Desing: First Grade
4 pages
Delphi Informant 95 2001
No ratings yet
Delphi Informant 95 2001
49 pages
Admit Card
No ratings yet
Admit Card
5 pages
Certificate: Rayat Shikshan Sanstha's Yashwantrao Chavan Institute of Science, Satara
No ratings yet
Certificate: Rayat Shikshan Sanstha's Yashwantrao Chavan Institute of Science, Satara
3 pages
Complete Download Digital Transformation and Knowledge Management 1st Edition Lucia Marchegiani PDF All Chapters
100% (1)
Complete Download Digital Transformation and Knowledge Management 1st Edition Lucia Marchegiani PDF All Chapters
55 pages
Propositional Logic in Artificial Intelligence
No ratings yet
Propositional Logic in Artificial Intelligence
8 pages
kcharanCV 11903981
No ratings yet
kcharanCV 11903981
2 pages
Project Slayers Codes July 2023 Pocket Tactics
No ratings yet
Project Slayers Codes July 2023 Pocket Tactics
1 page
EXERCISE 2 ARRAY RafiRidzuan
No ratings yet
EXERCISE 2 ARRAY RafiRidzuan
6 pages
Install Visual Studio Code (Actually Code-Server) On Android
No ratings yet
Install Visual Studio Code (Actually Code-Server) On Android
2 pages
Decisions Under Risk and Uncertainty
No ratings yet
Decisions Under Risk and Uncertainty
13 pages
Dell Compellent fs8600 The Purpose of This Document Is To Cover Specific Implementation
No ratings yet
Dell Compellent fs8600 The Purpose of This Document Is To Cover Specific Implementation
25 pages
BR-1180CD Rhythm Guide
No ratings yet
BR-1180CD Rhythm Guide
10 pages
Cisco CUCM CLI Useful Commands Cheat Sheet
No ratings yet
Cisco CUCM CLI Useful Commands Cheat Sheet
2 pages
May Anne M. Bunquin, ECE: Career Objective
No ratings yet
May Anne M. Bunquin, ECE: Career Objective
2 pages
Recap IT Class10 PDF
No ratings yet
Recap IT Class10 PDF
43 pages
The Syndicate. Part 2
No ratings yet
The Syndicate. Part 2
14 pages

MADE-WIC Multiple Annotated Datasets for Exploring Weaknesses In Code

Uploaded by

MADE-WIC Multiple Annotated Datasets for Exploring Weaknesses In Code

Uploaded by

MADE-WIC: Multiple Annotated Datasets for Exploring

Nicolás E. Díaz Ferreyra Barbara Russo

Dataset annotation, SATD, security, vulnerabilities 1600

ACM Reference Format: 1200

ploring Weaknesses In Code. In Proceedings of 39th IEEE/ACM International 600

NY, USA, 4 pages. https://doi.org/XXXXXXX.XXXXXXX

Guo et al Ren et al. and Maldonado Shihab

Existing public datasets on software data typically provide their

0.8 formation in the Common Vulnerability Enumeration (CVE) repos-

with W, and the number of functions annotated as technical debt

You might also like