API_MD
API_MD
3
Reykjavik University, Iceland
[email protected]
1 Introduction
Malware attacks are increasing all over the world year over year[17]. Zero-day
exploits, a term used to describe vulnerabilities in software that are unknown to
the software vendor and therefore unpatched, also help malicious attackers hide
their presence on hijacked machines. These exploits pose a significant challenge
for malware detection as they can be used to launch attacks that are difficult to
4
https://github.com/cfellicious/api-based-malware-detection
5
https://zenodo.org/records/11079764
2 C. Fellicious et al.
2 Related Work
Aboaoja et al. published a survey on the various issues and challenges of Mal-
ware Detection [1]. The authors outline that detecting evasive malware is still
one of the biggest challenges. Although there are several approaches to detecting
evasive malware, such as using multiple execution environments to identify eva-
sive behavior, the time and resource complexity lead to each approach having
its weakness. Shukla et al. developed a method that uses RNN to detect the so-
called stealthy malware [14]. The authors define stealthy malware as "malware
created by embedding the malware into a benign application through advanced
obfuscation strategies to thwart the detection." The authors "translate the ap-
plication binaries into images, further convert it into sequences, and extract
local features for stealthy malware detection." Feng et al. proposed the method
DawnGNN using Graph Attention Networks (GAT) [8]. The authors proposed a
novel documentation-augmented Windows Malware Detection Framework. The
method works by converting the API sequences into API graphs to extract con-
textual information. The authors encode the functionality descriptions using
BERT and finally use Graph attention for classification. Li et al. proposed a
method to detect dynamic malware based on API calls [10]. The authors used
intrinsic features of the API sequence. The authors claim that this allows the
models to capture and combine more meaningful features. The authors then
use the category, action, and operation object of the API to represent the se-
mantic information of each API call. The authors do the classification using
a Bidirectional LSTM module and their results outperform the baselines. Cui
et al. proposed a graph-based approach to detect malware from API-based call
sequences [7]. The proposed method works by creating two graphs, a Temporal
Process Graph (TPG) and a Temporal API Graph (TAG) to model intra-process
behavior. A heuristic random walk algorithm then generates several paths that
can capture the malware behavior. The authors generate the embeddings using
the paths pre-trained by the Doc2Vec model. Chen et al. proposed a parameter-
augmented approach for the Microsoft Windows platform called CruParamer [6].
The method employs rule-based and clustering-based classification to compute
the sensitivity of an API-call parameter to malicious behavior. The classifica-
tion is done by concatenating the API embedding to the sensitive embedding
of the labeled APIs so that their relationship is captured in the input vector.
The authors then train a binary classifier to identify malware, and according
to the authors, their model outperforms naive models. Almousa et al. proposed
a method to identify ransomware attacks based on API calls [2]. The authors
initially studied the lifecycle of ransomware on the Microsoft Windows platform.
The next step was to extract malicious code patterns. The authors used data
from publicly available repositories and sampled the malicious code in a sand-
box. Machine learning models were built based on this analysis and yielded a
high detection rate.
There are also multiple public malware datasets based on API calls. Catak
et al. published a malware dataset that contained different malware types [4].
The dataset contains eight malware types, namely, Trojan, Backdoor, Down-
4 C. Fellicious et al.
loader, Worms, Spyware Adware, Dropper, and Virus, for 7107 samples. The
authors created the dataset using the Cuckoo Sandbox, available on GitHub.
Zhang published a dataset of 10,654 samples with sample labels [20]. The au-
thor divides the dataset into normal, ransomware, miner, DDoS Trojan, worm,
infective virus, backdoor, and Trojan. This dataset is from the Alibaba Security
Algorithm Challenge. Trinh created a much larger dataset of 1.55 million sam-
ples, of which 800,000 malware and 750,000 "goodware" samples are present [19].
Another dataset is from Oliveira with "42,797 malware API call sequences and
1,079 goodware API call sequences" [13]. The dataset comprises API call se-
quences of the "first 100 non-repeated consecutive API calls associated with the
parent process, extracted from the ’calls’ elements of Cuckoo Sandbox reports".
A few other open datasets from challenges, such as the Aliyun Malware Detec-
tion Dataset [18] exist, but most of the datasets are either unavailable publicly
or have a too narrow scope. More recent work is by Maniriho et al., where the
authors created their dataset [11]. The dataset consists of 1285 malicious and
1285 benign samples. The dataset is publicly available and hosted on GitHub.
3 Method
We structure this section into two parts. The first describes the dataset, its
creation, and corresponding statistics. The second explains the order-invariant
method we use to generate the results.
3.1 Dataset
A popular method to detect malware includes tracing the Application Program
Interface (API) calls. An API, or application programming interface, is a set of
rules or protocols that enables software applications to communicate with each
other to exchange data, features, and functionality 6 . We created this dataset
to address the gaps in the existing landscape of the current malware-based API
call datasets with the following properties,
– up to date with the current malware
– large and varied enough to cover most modern malware
– does not restrict the labels to a single category. Malware might not always
fall into a specific category alone. From a cybersecurity perspective, it makes
more sense to group samples by the malware family.
– a dataset collected from the real-world machines. We worked closely with G
DATA CyberDefense AG for data collection. Working directly with a cyber-
security firm gives us the advantage of knowing the malware is from the real
world.
In a Microsoft Windows environment, user processes interact with the Oper-
ating System using dynamically linked libraries (DLL). The Windows ntdll.dll li-
brary is one such library. "The NTDLL runtime consists of hundreds of func-
tions that allow native applications to perform file I/O, interact with device
6
https://www.ibm.com/topics/api
Malware Detection based on API calls 5
drivers, and perform interprocess communications" [12]. Any malware or, for
that matter, the user process will need to communicate with the NTDLL li-
brary, which makes it the perfect library for hooking our API call logger. This
library contains hundreds of functions related to different functionalities, such
as semaphores, threads, creating events or objects, and so forth. The library
belongs to the Native API and can call functions in user or kernel mode [5].
Therefore, we decided to trace the function calls to the ntdll.dll library. Tracing
all the functions to the ntdll.dll is cumbersome as documentation only exists for
a few functions. Therefore, we select a subset of 59 proven valuable functions of
the NTDLL library [16]. We obtained this set of function calls in cooperation
with the cybersecurity experts from G DATA CyberDefense AG.
We wanted to keep up with the latest malware trends, so we collected samples
for approximately six months between 01.05.2023 and 01.11.2023. During this pe-
riod, we undertook the monumental task of observing approximately 1, 000, 000
malicious samples. Malware occurs in waves in the real world. If a specific mal-
ware type is successful, it spreads like wildfire across different networks. This,
in turn, caused a spike in a few classes in our dataset, and we removed such
samples to avoid biasing any classifier to a few classes. We collected these sam-
ples (approximately 330k), and we grouped the samples into different malware
families. Labeling this data proved to be a challenge as well. However, the most
common form of labeling data for malware is grouping it into predefined classes
such as trojan, virus, backdoor, rootkit, and so on. But newer malware could
have overlapping patterns with multiple labels and having a single label might
not capture the complete behavior of the malware. Therefore, rather than group
them into a single category, such as a virus, backdoor, or trojan, we group them
into distinct malware families based on source code analysis from G DATA Cy-
berDefense AG. This means that malware belonging to the same label in our
dataset will exhibit the same properties and this should allow for better grouping
and analysis by cybersecurity researchers.
For a malware detection dataset, we also require data from benign software
as well. Acquiring benign software is not a problem at all, as there are millions
of verified benign software from different organizations all over the world. Our
problem was that most benign software required user interaction to install or
had a Graphical User Interface (GUI). The drawback of having such benign
software in the dataset was that delineating between malware and benign soft-
ware would be easy as most malware does not have a GUI. We needed benign
software that did not have a GUI and required no user interaction to execute.
This criterion alone made it similar to some types of malware as some malware
genres also require no input from the user [9,3]. Our cybersecurity partner firm
maintains a comprehensive whitelist of benign software, which includes a variety
of trusted applications and services regularly used in enterprise environments.
This whitelist was instrumental in ensuring the integrity and reliability of our
benign sample set. In addition to this whitelist, we included Microsoft service
executables, which are widely recognized as baseline components of the Windows
operating system. We aimed to cover a broad spectrum of typical, non-malicious
6 C. Fellicious et al.
7
https://github.com/prskr/inetmock
Malware Detection based on API calls 7
{
" level " : " info " ,
" ts " : " 2023 -12 -04 T09 :23:36 Z " ,
" msg " : " Monitored function called " ,
" vmi_ts " : " 2023 -12 -04 T09 :19:09 Z " ,
" vmi_logger " : " A p iT ra ci n g_ Fu n ct io nH o ok " ,
" vmi_Parameterlist " :
[
{ " Sy ste mI nfo rma ti onC la ss " : 1 9 2 } ,
{ " SystemInformation " : 1 3 7 3 5 2 0 } ,
{ " S ys te m In fo rm a ti on Le n gt h " : 3 2 } ,
{ " ReturnLength " : 0 }
],
" vmi_FunctionName " : " N t Q ue r y Sy s t em I n fo r m at i o n " ,
" vmi_ModuleName " : " ntdll . dll " ,
" vmi_ProcessDtb " : " 65 aca001 " ,
" vmi_ProcessTeb " : " 2 e6000 "
}
eters into a JSON file named SHA. We show the sample trace of a single API
call in Figure 1.
We created an online repository containing all information about the dataset
and our code along with the names of the traced functions8 . Overall, the un-
compressed size of the dataset is approximately 572GB in total. We already
published the dataset on Zenodo9 , and the dataset is currently public access.
Vi = |Calli | (2)
When we consider the Bigram model, we also look at the immediately pre-
ceding API call for the second part. Furthermore, we do this using a sliding
window approach over the entire API call sequence. The length of the feature
vector in theory would be the total number of combinations of two API calls,
which is |V |2 (592 ).
Like the Unigram model, we create a feature vector for every combination of
API calls. The vector would comprise two consecutive calls we concatenate, like
Call1 Call2 . The total count will be the value for the feature at index i.
where, θ is a mapping from two consecutive API calls to an index. For the third
part, we consider two previous API calls for the sequence. We follow the same
procedure as the Unigram model with the only difference in the length of the
feature vector and the number of API calls considered.
And for the Trigram model, we consider three consecutive calls as given
in Equation 4. In this case, the size of the feature vector expands to |V |3 (=
Malware Detection based on API calls 9
593 ) which is quite large compared to the feature vectors of the Unigram and
Bigram models.
Vi = |θ(Calli−2 , Calli−1 , Calli )| (4)
where, θ is a mapping from three consecutive API calls to an index. The in-
dex mapping the unique function sequences to an index is provided as a JSON
file in our online code repository. The final model, which we call the Combined
model, consists of creating a feature vector that concatenates the feature vec-
tors of the Unigram, Bigram, and Trigram models. We do this in the hope that
the Combined Model exploits the positive aspects of Unigram, Bigram, and Tri-
gram models independently to obtain discriminating information from any of
the inputs of the models. The feature vector length for the Unigram model is
59 corresponding to the number of traced functions to the ntdll.dll, which is
manageable, but the Bigram and Trigram models have theoretical lengths of
3481 and 205379. Although the Bigram feature vector is manageable, it is still
quite large, and a Trigram-based feature vector is only possible for more than
330k samples on machines with substantial amounts of memory. Moreover, such
a feature vector could be sparse since most values would be zero. Therefore, we
efficiently identify all the unique bigram and trigram function calls and create
feature vectors using only those present in the dataset. There are 2540 unique
bigram calls and 5483 unique trigram combinations in the dataset. Therefore,
for practicality and to save memory, we limit the Bigram model and Trigram
model feature vectors to a length of 2540 and 5483 respectively. We then train a
random forest on these feature vectors and predict whether a given sample from
the test set is malware or benign. The dataset is unbalanced, containing many
more malicious samples than benign ones.
4 Results
per sample, we can confidently identify the sample’s malicious nature, demon-
strating the power of our analysis without the need for temporal information.
Fifty-nine unique API calls to the ntdll.dll library correspond to the fifty-nine
functions we mapped. Using these fifty-nine unique function calls, we find 2540
unique (where we consider two consecutive API calls) API calls and 5483 unique
trigrams (where we consider three consecutive API calls) API calls using a
sliding window approach. An exciting aspect from Figure 2 is in the trigram-
based model (where we consider a window of three consecutive API calls). The
F1-Score drops off after 200 API calls, and this is due to the large number of
unique sequences, which spreads the count out along the feature vector. Due to
the large number of unique occurrences, the number of calls tends to be unique
and the feature vector is a smear of mostly individual unique API calls. In
Fig. 2. F1-Score for all our models at different max API call counts. The X-axis is on
a logarithmic scale. The results are the average of four different runs.
our case, we set the benign class to "1" and the malware class to "0". We do
this due to highly imbalanced data favoring the malware class. In a real-world
scenario, we do not want malware samples identified as benign. Therefore, we
need a very precise model so that no malware is identified as benign. Although
having a higher recall is good, in our case, it should not be at the expense of
precision. If our model has false negatives, it means that some benign software
Malware Detection based on API calls 11
was classified as malware and could warrant a closer look. However, having lower
precision might have drastic consequences. A precision-recall curve can help us
understand the model’s confidence in test data predictions. From Figure 3, we
see that the Unigram model (only single function call irrelevant of order) per-
forms very well along with the Bigram, Trigram and combined models. The only
exception is the Trigram model, where the precision drops drastically. From this,
having a Unigram model should be sufficient for identifying malware from benign
software. All our results and plots are given in our GitHub repository10 .
Fig. 3. Precision Recall curve for the Unigram, Bigram, Trigram, and combined models.
We see that the Trigram model confidence drops very quickly compared to the other
models.
From the different metrics and plots, we see that the Bigram model and the
Unigram model provide the best performance overall. However, the Unigram
model could be favored more simply due to the lower memory requirements and
faster execution.
5 Conclusion
We created a substantial corpus of API calls from recent malware and benign
software, with a specific focus on the ntdll Windows library, underscoring its
significance in our research. Our dataset, the largest of its kind available pub-
licly, is a unique and invaluable resource for the research community in malware
detection. Our experiments demonstrate that even the basic metric of function
call counts can significantly distinguish between malicious and benign software.
The created models show that by using good feature engineering techniques, we
can detect malware precisely with negligible performance overhead. Our method
has been proven to detect malware with a small number of API calls, demon-
strating its efficiency and practicality. Furthermore, we found that for a malware
detection system to be effective, it must analyze at least two hundred and fifty
10
https://github.com/cfellicious/api-based-malware-detection
12 C. Fellicious et al.
API calls to the ntdll.dll library. This threshold ensures a reliable level of de-
tection accuracy, reinforcing the importance of comprehensive data collection in
developing robust malware detection systems. Our findings also highlight the
potential of simple yet effective features in developing efficient and scalable mal-
ware detection solutions, paving the way for future research and advancements
in cybersecurity.
References
1. Aboaoja, F.A., Zainal, A., Ghaleb, F.A., Al-Rimy, B.A.S., Eisa, T.A.E., Elnour,
A.A.H.: Malware detection issues, challenges, and future directions: A survey. Ap-
plied Sciences 12(17), 8482 (2022)
2. Almousa, M., Basavaraju, S., Anwar, M.: Api-based ransomware detection using
machine learning-based threat detection models. In: 2021 18th International Con-
ference on Privacy, Security and Trust (PST). pp. 1–7. IEEE (2021)
3. Bilot, T., El Madhoun, N., Al Agha, K., Zouaoui, A.: A survey on malware detec-
tion with graph representation learning. ACM Computing Surveys 56(11), 1–36
(2024)
4. Catak, F.O., Ahmed, J., Sahinbas, K., Khand, Z.H.: Data augmentation based
malware detection using convolutional neural networks. PeerJ Computer Science
7, e346 (Jan 2021). https://doi.org/10.7717/peerj-cs.346, https://doi.org/
10.7717/peerj-cs.346
5. Chappell, G.: Native api functions (2024), https://www.geoffchappell.com/
studies/windows/win32/ntdll/api/native.htm
6. Chen, X., Hao, Z., Li, L., Cui, L., Zhu, Y., Ding, Z., Liu, Y.: Cruparamer: Learning
on parameter-augmented api sequences for malware detection. IEEE Transactions
on Information Forensics and Security 17, 788–803 (2022)
7. Cui, L., Cui, J., Ji, Y., Hao, Z., Li, L., Ding, Z.: Api2vec: Learning representations
of api sequences for malware detection. In: Proceedings of the 32nd ACM SIGSOFT
International Symposium on Software Testing and Analysis. pp. 261–273 (2023)
8. Feng, P., Gai, L., Yang, L., Wang, Q., Li, T., Xi, N., Ma, J.: Dawngnn: Doc-
umentation augmented windows malware detection using graph neural network.
Computers & Security p. 103788 (2024)
9. Gopinath, M., Sethuraman, S.C.: A comprehensive survey on deep learning based
malware detection techniques. Computer Science Review 47, 100529 (2023)
10. Li, C., Lv, Q., Li, N., Wang, Y., Sun, D., Qiao, Y.: A novel deep framework for
dynamic malware detection based on api sequence intrinsic features. Computers &
Security 116, 102686 (2022)
11. Maniriho, P., Mahmood, A.N., Chowdhury, M.J.M.: Api-maldetect: Automated
malware detection framework for windows based on api calls and deep learning
techniques. Journal of Network and Computer Applications 218, 103704 (2023)
12. Microsoft: Inside native applications - sysinternals (2024), https://learn.
microsoft.com/en-us/sysinternals/resources/inside-native-applications
13. de Oliveira, A.S., Sassi, R.J.: Behavioral malware detection using deep graph con-
volutional neural networks. Authorea Preprints (2023)
14. Shukla, S., Kolhe, G., PD, S.M., Rafatirad, S.: Stealthy malware detection using
rnn-based automated localized feature extraction and classifier. In: 2019 IEEE 31st
international conference on tools with artificial intelligence (ICTAI). pp. 590–597.
IEEE (2019)
Malware Detection based on API calls 13