MADE-WIC Multiple Annotated Datasets for Exploring Weaknesses In Code
MADE-WIC Multiple Annotated Datasets for Exploring Weaknesses In Code
Weaknesses In Code
Moritz Mock Jorge Melegati Max Kretschmann
Free University of Bozen-Bolzano Free University of Bozen-Bolzano Hamburg University of Technology
Bozen-Bolzano, Italy Bozen-Bolzano, Italy Hamburg, Germany
[email protected] [email protected] [email protected]
[email protected] [email protected]
ABSTRACT with heuristics derived for the available data or by statistic ana-
In this paper, we present MADE-WIC, a large dataset of func- lyzers, which are not always accurate [10]. Differences in schema
tions and their comments with multiple annotations for technical and annotation may prevent study replication and generalization.
debt and code weaknesses leveraging different state-of-the-art ap- This is, for instance, the case of datasets annotated for vulnerability
proaches. It contains about 860K code functions and more than 2.7M detection, for which literature reports several issues. Vulnerabil-
related comments from 12 open-source projects. To the best of our ity datasets often rely on human-labelled techniques (e.g., commit
knowledge, no such dataset is publicly available. MADE-WIC aims differential analysis) that are resource-intensive [22]. A significant
to provide researchers with a curated dataset on which to test and number (more than 60%) of vulnerabilities miss any annotation
compare tools designed for the detection of code weaknesses and in practice, for example, due to silent fixes (i.e., developers com-
technical debt. As we have fused existing datasets, researchers mit changes to fix vulnerabilities but do not label/report the com-
have the possibility to evaluate the performance of their tools by mits [16]), which implies that the actual number of vulnerabilities
also controlling the bias related to the annotation definition and is much more than what it can be. Automated labelling results in a
dataset construction. The demonstration video can be retrieved at high percentage of incorrectly labelled vulnerabilities [16]. Finally,
https://www.youtube.com/watch?v=GaQodPrcb6E. the different types of annotation techniques may produce different
annotated datasets and detection results [5]. Another example are
CCS CONCEPTS techniques to annotate self-admitted technical debt (SATD), i.e.,
annotate comments and related code (i.e., technical debt) that has
• Software and its engineering → Maintaining software.
low quality and requires future effort for refactoring [21]. Fig. 1
KEYWORDS SATD
Moritz Mock, Jorge Melegati, Max Kretschmann, Nicolás E. Díaz Ferreyra, 1000 1.43%
and Barbara Russo. 2018. MADE-WIC: Multiple Annotated Datasets for Ex- 800
5.94%
0
A
Co
EM
JE
JF
JM
JR
SQ
pa
rg
ib
re
di
ub
lu
et
ui
F
er
oU
ch
eC
t
m
er
rre
n
e
ba
M
at
ha
l
A
1 INTRODUCTION
e
L
rt
nt
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed illustrates the difference in size of the same datasets annotated with
for profit or commercial advantage and that copies bear this notice and the full citation the same technique by two different authors Guo et al. [9] and Ren
on the first page. Copyrights for components of this work owned by others than ACM et al. [20]. The figure reports the number of SATD instances (y-axis)
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a and the percentage of SATD in the respective project. For instance,
fee. Request permissions from [email protected]. there is a difference of 2.5% in SATD instances for JRuby project.
ASE 2024, October 2024, Sacramento, California, United States Differences in datasets’ construction and annotation can produce
© 2018 Association for Computing Machinery.
ACM ISBN 978-1-4503-XXXX-X/18/06. . . $15.00 different detection performances, as shown in Fig. 2. In this paper,
https://doi.org/XXXXXXX.XXXXXXX we apply data fusion [1] to three existing datasets (WeakSATD [21],
ASE 2024, October 2024, Sacramento, California, United States Moritz Mock, Jorge Melegati, Max Kretschmann, Nicolás E. Díaz Ferreyra, and Barbara Russo
Big-Vul: The annotation method used in Big-Vul [6] parses the in-
1
0.9
0.5
publicly available Git repositories (e.g., GitHub) and their related
0.4 bugs. From the related BugID, the fix commits and the previous
0.3
versions of changed functions are retrieved and labelled vulnerable.
0.2
0.1
W: The W annotation uses heuristics extracted from the MITRE
0
Apache Ant ArgoUML Columba EMF Hibernate JEdit JFreeChart JMeter JRuby SQuirrel
Common Weakness Enumeration repository (CWE) [13], as pro-
Precision G Precision R Recall G Recall R F1 G F1 R
posed in WeakSATD approach [21]. These are a set of rules that
implement the descriptions provided in the CWE reports. We used
Figure 2: Differences in performance reported by Guo et al. of the rules available in the replication package of WeakSATD [21].
the Ren et al. approach [9] and in the original work of Ren
et al. [20] 2.2 Function Annotation through Comments
PS: The PS annotation labels functions as technical debt if they
have at least one comment annotated with one of the 64 SATD
Devign [23], and Big-Vul [6]) and build MADE-WICthereafter, a patterns [19]. The patterns have been identified by manually in-
novel curated dataset of functions and comments (including leading specting more than 100,000 Java comments.
comments, i.e., those comments preceding and related to a func- MAT: The MAT annotation leverages the Matches task Annota-
tion), labelled for vulnerability, technical debt and security concerns. tion Tags (MAT) heuristics [9] to label functions as technical debt
The dataset provides a unique schema and different annotations if they have at least one comment annotated with one MAT tag
for the same instances and the above attributes. MADE-WIC con- TODO, FIXME, XXX, and HACK. MAT tags have been identified by
tains about 860k functions and 2.7M of comments from 12 projects. exploring the default syntax highlighting of different IDEs.
We also propose an approach for the construction and annotation SecI: The SecI annotation first automatically label functions as vul-
compliant with the schema of MADE-WIC that enables extension nerable if they have at least one comment annotated with one of the
to further projects. The dataset, including the code to create it, is 288 security indicators [3]. Then, functions and their annotations
publicly available in the replication package [15]. have been manually reviewed by three of the authors, achieving a
Fleiss’ Kappa [8] score of 0.735 for their agreement. The result is a
2 ANNOTATION APPROACHES set of 89 agreed security indicators.
We implemented different techniques with which we labelled the
functions of MADE-WIC either directly or through related com- 3 DATASET FUSION
ments as described below. MADE-WIC fuses and extends two existing datasets Devign [23],
and Big-Vul [6] by creating a unique schema that transforms the
2.1 Function Annotations data present in the sources into a common representation. The
vf: We indicate with vf the original annotations of Devign and selected datasets and projects use C/C++, provide functions and
Big-Vul as described in the following. comments or references to the GitHub projects from which to
Devign: The Devign approach [23] leverages a list of security-related extract them, and make their annotation approach publicly available.
keywords in commits’ messages to classify commits as vulnerable, The resulting dataset consists of functions and their comments
collects the previous version of the functions changed in the com- (internal and leading comments). Fig. 3 illustrates the overall fusion
mits and then manually annotates them for vulnerability. and annotation approach as described in the following sections.
Rules
Chromium
Extract functions with
Linux Kernel Remove duplicates Annotate Functions
Leading Comments
Mozilla FireFox
Projectname,
MADE-WIC
Rule based
Commit-ID, Annotated
Filepath, Function
Function
Projects 1-11 CSV-Files
CSV-Files
Patterns
Devign Extract Leading
Extract features Annotate Comments
Big-Vul Comments
Projectname,
Commit-ID, Leading- Annotated
Filepath, Comment Comments
Annotated
Function
Figure 3: Fusion approach to generate MADE-WIC, extracting the information from existing datasets and open source projects.
MADE-WIC: Multiple Annotated Datasets for Exploring Weaknesses In Code ASE 2024, October 2024, Sacramento, California, United States
Table 1: MADE-WIC schema. annotation as described in Section 2.1. The annotations Big-Vul
and Devign are extracted from the original datasets. The attributes
Name Type Description Project-name, Commit-ID, and Filepath can be used to verify or ex-
tend in the future MADE-WIC. Table 2 describes the three CSV files
Project- string Project from which the function was
compounding MADE-WIC. For each file, we report: (fn) the num-
name extracted
ber of functions, (vf) the number of vulnerable functions as in the
Commit- string Commit hash from which the function
original dataset, (W) the number of vulnerable functions annotated
ID was extracted
Data
detection.
Big-Vul boolean Flag indicating Big-Vul annotation
Devign boolean Flag indicating Devign annotation
W boolean Flag indicating W annotation 3.3 Application Scenarios
SecI boolean Flag indicating SecI annotation Our dataset can be used for various classification tasks that involve
functions and/or their comments. For instance, it can be used to
fine-tune pre-trained deep learning models for text and code (e.g.,
3.1 Data Extraction: BERT-based transformers such as CodeBERT [7]) for downstream
To extract the data, we implemented and automated two processes: tasks that classify functions as technical debt and/or vulnerable by
1) from the annotated datasets Devign and BigVul and 2) from the simply exploiting MADE-WIC with the PS annotation for technical
open-source repositories starting from the Chromium project used debt and W for weakness. Studies can also understand the impact
in WeakSATD and extended it to Linux Kernel, Mozilla FireFox. of different annotation techniques on the same data by, for instance,
From annotated datasets. From Devign and BigVul, we extracted comparing MADE-WIC on vf and W, or the different subsets of
the features (Project Name, Commit-ID, File Name, Annotation, MADE-WIC, Devign and BigVul, with their original vf annotations.
Function) and used the commit-ID to get the project version. We MADE-WIC can also be used for the summarization of specific com-
then used srcML [2] to obtain any leading comment of the functions. ments (e.g., generating SATD or security-related comments from
Given that retrieving the leading comments is time-consuming, and functions) and masking tasks on functions and/or comments. For
Big-Vul is a very large dataset, we choose the ten projects with instance, researchers can mask PS patterns in comments or W pat-
the largest number of vulnerabilities, accounting for 75% of the terns in functions and compare the ability of different transformers
total. The projects are listed in Table 2. From the table, we can also to retrieve back the patterns.
see that three of the projects (Linux, Chrome, and FFmpeg) are
shared between Devign and BigVul, but the different extraction and 3.4 Data Quality
annotation techniques make them different in size and composition.
In this section, we leverage the work of Croft et al. [4] to assess the
From open source public repositories. From open source projects
quality of MADE-WIC. The paper provides a set of attributes for
(Chromium, Linux Kernel, Mozilla FireFox), we cloned the repos-
high-quality of software vulnerability datasets.
itories at the last commit 1 and extract with srcML the features
Accuracy - The degree to which the data has attributes that correctly
(FileName, Annotation, Function, and leading comment, if any).
represent the true value of the intended attribute of a concept or event.
Finally, in both processes, we removed duplicates using PMD-
To ensure accuracy, we first decided to use the state-of-the-art an-
CPD [18], which tokenizes the functions and calculates their sim-
notations. Then, we manually reviewed 31 functions annotated
ilarity based on a threshold of 30 tokens. The threshold balances
as vulnerable by W (one function per heuristic of the W annota-
computational effort with the thoroughness of the inspection and
tion) and their comments from the OSPR portion. Each of the 31
includes 85% of all functions. Then, we annotated the functions
functions was independently verified by three authors. In the end,
in multiple ways, either directly or through the comments as de-
6 heuristics have been tuned and 10 removed. The authors then
scribed in Section 2.1. The annotated functions and their comments
agreed on 21 rules and corresponding vulnerable functions.
are then stored in the replication package in three CSV files called
Uniqueness - The degree to which there is no duplication in records.
OSPR, BigVul, and Devign.
To remove duplicated functions, we used PMD-CPD [18] and a 99%
threshold for the Jaccard index of overlapping function code. This
3.2 Schema process was carried out for each of the datasets, both for individual
Table 1 illustrates the schema resulting from the fusion: each row projects and between different projects. In the end, we removed
describes a column of a CSV file of MADE-WIC. The schema con- 127 duplicates.
tains eleven attributes. The annotations’ flag indicates the type of Consistency - The degree to which data instances have attributes that
1 commit hash 57f97b2 for Chromium, e2ca6ba for Linux Kernel, and 4d46db3ff28b for are free from contradiction and are coherent with other instances.
Mozilla Firefox. As there is no oracle for the attributes we considered in this work,
ASE 2024, October 2024, Sacramento, California, United States Moritz Mock, Jorge Melegati, Max Kretschmann, Nicolás E. Díaz Ferreyra, and Barbara Russo
Table 2: Functions (including their comments) in MADE-WIC: Chromium functions and comments from the original source. Not
total (fn), vulnerability annotated as in original datasets (vf), all the datasets in the literature could also be fused with our ap-
vulnerability annotated as weaknesses (W), technical debt proach. For instance, the dataset of Lin et al. [11] that can be found
annotated as (PS) and (MAT), and security annotated as (SecI). in the replication package by Nong et al. [17] does not provide the
commit-IDs from which the functions were extracted, preventing
Project fn vf W PS MAT SecI researchers from retrieving any further data than the ones they
OSPR, language: C published, e.g., leading comment. D2A [22] is a dataset that lever-
Chromium 20,028 - 5,205 151 508 514 ages six open-source projects at function-level granularity. The
Linux Ker. 652,726 - 210,220 5,594 8,444 13,182 annotation was done using a tool-based approach, focusing on
Mozilla Fir. 15,380 - 4,200 116 436 141 the function versions before and after a fixing commit. Due to its
Total 688,134 - 219,625 5,861 9,388 13,837 large size of 3.7 GB, which includes more than 1.2 million instances,
Devign, language: C integrating this extensive dataset will be addressed in future work.
FFmpeg 9,738 4,961 7,321 1,032 1,366 732
Acknowledgments. Moritz Mock is partially funded by the Na-
Qemu 17,544 7,476 7,785 477 1,378 653
tional Recovery and Resilience Plan (Piano Nazionale di Ripresa e
Total 27,282 12,437 15,106 1,509 2,744 1,385
Resilienza, PNRR - DM 117/2023). The work has been funded by
Big-Vul, language: C, C++ project no. EFRE1039 under the 2023 EFRE/FESR program.
Android 8,671 1,267 3,598 53 184 316
Chromium 77,167 3,938 10,434 165 569 334 REFERENCES
FFmpeg 1,925 114 1,201 70 99 79 [1] Jens Bleiholder and Felix Naumann. 2009. Data fusion. ACM Comput. Surv. 41, 1,
File(1) comm. 294 49 207 12 7 0 Article 1 (jan 2009), 41 pages.
ImageMagick 2,489 338 1,271 12 11 91 [2] Michael L. Collard, Michael John Decker, and Jonathan I. Maletic. 2013. srcML:
An Infrastructure for the Exploration, Analysis, and Manipulation of Source
Kerberos5 832 140 478 15 31 187 Code: A Tool Demonstration (ICSME). 516–519.
Linux Ker. 46,828 1,955 18,298 768 886 2,652 [3] Roland Croft et al. 2022. An empirical study of developers’ discussions about
security challenges of different programming languages. Empirical Software
PHP Interp. 2,669 364 1,580 59 88 173 Engineering 27 (2022), 1–52.
Radare2 1,168 73 722 19 67 19 [4] Roland Croft, M. Ali Babar, and M. Mehdi Kholoosi. 2023. Data Quality for
Tcpdump 778 210 532 64 110 77 Software Vulnerability Datasets (ICSE). 121–133.
[5] Roland Croft, Yongzheng Xie, and Muhammad Ali Babar. 2022. Data prepara-
Total 144,358 8,448 38,214 1,237 2,042 3,928 tion for software vulnerability prediction: A systematic literature review. IEEE
Grand Total 859,774 20,885 272,945 8,607 14,174 19,150 Transactions on Software Engineering 49, 3 (2022), 1044–1063.
[6] Jiahao Fan, Yi Li, Shaohua Wang, and Tien N. Nguyen. 2020. A C/C++ Code Vul-
nerability Dataset with Code Changes and CVE Summaries (MSR ’20). Association
for Computing Machinery, New York, NY, USA, 508–512.
the different annotation techniques for one attribute (e.g., vulner- [7] Zhangyin Feng et al. 2020. CodeBERT: A pre-trained model for programming
ability) have reported a portion of functions with positive and and natural languages. EMNLP 2020 (2020), 1536–1547.
[8] Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters.
negative annotation depending on the technique. For instance, the Psychological bulletin 76, 5 (1971), 378.
percentage of functions that are consistently annotated vf and W [9] Zhaoqiang Guo et al. 2021. How Far Have We Progressed in Identifying Self-
ranges between 8% and 45%, while for technical debt (PS and MAT Admitted Technical Debts? A Comprehensive Empirical Study.
[10] Kim Herzig and Andreas Zeller. 2013. The impact of tangled code changes (MSR).
annotation), it ranges between 0% and 68% over individual projects. IEEE, 121–130.
Completeness - The degree to which subject data associated with an [11] Guanjun Lin, Wei Xiao, Jun Zhang, and Yang Xiang. 2020. Deep Learning-Based
entity has values for all expected attributes and related instances. Vulnerable Function Detection: A Benchmark. In Information and Communi-
cations Security, Jianying Zhou, Xiapu Luo, Qingni Shen, and Zhen Xu (Eds.).
We provided multiple annotations for all attributes of interest (vul- Springer International Publishing, Cham, 219–232.
nerability, technical debt, security concerns). Some of the OSPR [12] Everton da Silva Maldonado, Emad Shihab, and Nikolaos Tsantalis. 2017. Using
Natural Language Processing to Automatically Detect Self-Admitted Technical
projects miss the Big-Vul or Devign original annotations and will Debt. IEEE Transactions on Software Engineering 43, 11 (2017), 1044–1062.
be matter of future work. [13] MIRTE. 2006. Common Weakness Enumaration. https://cwe.mitre.org
Currentness - The degree to which data has attributes that are of the [14] MITRE. 1999. Common Vulnerabilities and Exposures. https://cve.mitre.org
[15] Moritz Mock, Jorge Melegati, et al. 2024. Replication package. https://doi.org/10.
right age. The OSPR data was extracted and annotated in 2023 5281/zenodo.12567874
with the exception of the Chromium project, whose functions [16] Giang Nguyen-Truong et al. 2022. HERMES: Using Commit-Issue Linking to
were extracted from the WeakSATD repository [21]. The Devign Detect Vulnerability-Fixing Commits (SANER). 51–62.
[17] Yu Nong et al. 2023. Open Science in Software Engineering: A Study on Deep
and Big-Vul datasets have been extracted as-is from the original Learning-Based Vulnerability Detection. IEEE Transactions on Software Engineer-
sources [6, 23]. ing 49, 4 (2023), 1983–2005.
[18] PMD 2024. PMD-CPD. https://pmd.github.io/pmd/pmd_userdocs_cpd.html
[19] Aniket Potdar and Emad Shihab. 2014. An Exploratory Study on Self-Admitted
4 RELATED WORK AND CONCLUSIONS Technical Debt (ICSME). 91–100.
[20] Xiaoxue Ren et al. 2019. Neural Network-Based Detection of Self-Admitted
To the best of our knowledge, MADE-WIC is the first dataset that Technical Debt: From Performance to Explainability. ACM Trans. Softw. Eng.
provides functions, all relevant comments and their annotations Methodol. 28, 3, Article 15 (jul 2019), 45 pages.
for technical debt, vulnerability and security concerns. The three [21] Barbara Russo, Matteo Camilli, and Moritz Mock. 2022. WeakSATD: Detecting
Weak Self-admitted Technical Debt (MSR). 448–453.
datasets we fused in this study (OSPR, Devign, Big-Vul) have single [22] Yunhui Zheng et al. 2021. D2a: A dataset built for ai-based vulnerability detection
annotations or different annotation methods. For instance, Weak- methods using differential analysis (ICSE-SEIP). IEEE, 111–120.
SATD [21] annotates for SATD files of the Chromium project with [23] Yaqin Zhou et al. 2019. Devign: Effective Vulnerability Identification by Learning
Comprehensive Program Semantics via Graph Neural Networks. Curran Associates.
PS. In our work, we have reviewed this approach and extracted