0% found this document useful (0 votes)
2 views

API_MD

The document presents a lightweight, order-invariant approach to malware detection based on analyzing API calls, utilizing a newly created public dataset of over 300,000 samples. The research demonstrates that effective malware identification can be achieved with minimal performance overhead, achieving an F1-Score exceeding 85% using machine learning algorithms. The dataset and code are publicly available, addressing the need for comprehensive resources in the cybersecurity community to combat evolving malware threats.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

API_MD

The document presents a lightweight, order-invariant approach to malware detection based on analyzing API calls, utilizing a newly created public dataset of over 300,000 samples. The research demonstrates that effective malware identification can be achieved with minimal performance overhead, achieving an F1-Score exceeding 85% using machine learning algorithms. The dataset and code are publicly available, addressing the need for comprehensive resources in the cybersecurity community to combat evolving malware threats.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Malware Detection based on API calls

Christofer Fellicious1[0000−0001−7487−7110] , Manuel Bischof2 , Kevin Mayer2 ,


Dorian Eikenberg2 , Stefan Hausotte2 , Hans P. Reiser1,3[0000−0002−2815−5747] ,
and Michael Granitzer1[0000−0003−3566−5507]
1
University of Passau, Germany [email protected]
2
GData Cyberdefense, Germany [email protected]
arXiv:2502.12863v1 [cs.CR] 18 Feb 2025

3
Reykjavik University, Iceland
[email protected]

Abstract. Malware attacks pose a significant threat in today’s intercon-


nected digital landscape, causing billions of dollars in damages. Detecting
and identifying families as early as possible provides an edge in protect-
ing against such malware. We explore a lightweight, order-invariant ap-
proach to detecting and mitigating malware threats: analyzing API calls
without regard to their sequence. We publish a public dataset of over
three hundred thousand samples and their function call parameters for
this task, annotated with labels indicating benign or malicious activity.
The complete dataset is above 550GB uncompressed in size. We lever-
age machine learning algorithms, such as random forests, and conduct
behavioral analysis by examining patterns and anomalies in API call se-
quences. By investigating how the function calls occur regardless of their
order, we can identify discriminating features that can help us identify
malware early on. The models we’ve developed are not only effective but
also efficient. They are lightweight and can run on any machine with min-
imal performance overhead, while still achieving an impressive F1-Score
of over 85%. We also empirically show that we only need a subset of the
function call sequence, specifically calls to the ntdll.dll library, to iden-
tify malware. Our research demonstrates the efficacy of this approach
through empirical evaluations, underscoring its accuracy and scalabil-
ity. The code is open source and available at Github4 along with the
dataset5 .

Keywords: machine learning · dataset · API based malware detection.

1 Introduction
Malware attacks are increasing all over the world year over year[17]. Zero-day
exploits, a term used to describe vulnerabilities in software that are unknown to
the software vendor and therefore unpatched, also help malicious attackers hide
their presence on hijacked machines. These exploits pose a significant challenge
for malware detection as they can be used to launch attacks that are difficult to
4
https://github.com/cfellicious/api-based-malware-detection
5
https://zenodo.org/records/11079764
2 C. Fellicious et al.

detect and defend against. Polymorphic malware presents a significant challenge


for traditional code-based checks, as the code itself gets encrypted, making it
difficult to identify and analyze. With the introduction of polymorphic malware
into the wild, regular code-based checks do not work, as the code itself gets
encrypted, along with garbage values inserted into the malware payload files.
Machine learning shows promise in identifying malware but requires large
amounts of data to generalize well. Although many cybersecurity companies
use machine-learning-based methods, most datasets are proprietary. Existing
datasets are often proprietary, narrow in scope, or insufficiently comprehensive,
posing challenges for researchers and developers aiming to create and refine
machine learning models for malware detection. While some publicly available
datasets exist, they tend to be outdated, small in scale, or lacking in diversity,
limiting their utility in addressing modern malware’s sophisticated and rapidly
evolving nature. This highlights a critical need for the development and dissem-
ination of large-scale, diverse datasets that encompass a wide range of benign
and malicious software behaviors. We address this gap by creating a dataset
from data collected with the help of G DATA CyberDefense AG and making it
publicly available.
Another aspect is the practical side of malware detection. System software
running on machines should have a manageable overhead and not impede the
performance of essential business software. Therefore, having very complex mod-
els for real-time analysis that take up valuable resources is not feasible, as the
memory and performance requirements will hinder other software. We could
have more complex offline models for forensic analysis. However, the best option
on a running machine would be a lightweight model that uses simple feature
engineering but offers a very high degree of accuracy.
Our primary contribution is the creation of the largest publicly available
dataset of API calls, with over 300,000 malware samples and 10,000 benign sam-
ples of API call instances sourced from recent malware and benign software
samples. The uncompressed size of the dataset exceeds 550GB and is available
on Zenodo. The second contribution is lightweight models that could be used
in real-time malware detection based on their API calls. We empirically show
that we could detect malware reliably with as few as 250 API calls. Our findings
show that the mere frequency of API calls is a powerful indicator of malicious
intent, enabling the detection of malware with high certainty. Moreover, our
research underscores the importance of comprehensive data collection in devel-
oping robust malware detection systems, providing a valuable resource for the
cybersecurity community. For our third contribution, we conduct a comprehen-
sive study to determine the number of API calls required for the best detection
performance at different maximum API call counts.
We introduce the premise and research gap in section section 1. We discuss
the current landscape of malware detection methods in section 2. Our dataset
creation and malware identification method is present in section 3. We present
the results of our experiments in section 4 and our conclusions in section 5.
Malware Detection based on API calls 3

2 Related Work
Aboaoja et al. published a survey on the various issues and challenges of Mal-
ware Detection [1]. The authors outline that detecting evasive malware is still
one of the biggest challenges. Although there are several approaches to detecting
evasive malware, such as using multiple execution environments to identify eva-
sive behavior, the time and resource complexity lead to each approach having
its weakness. Shukla et al. developed a method that uses RNN to detect the so-
called stealthy malware [14]. The authors define stealthy malware as "malware
created by embedding the malware into a benign application through advanced
obfuscation strategies to thwart the detection." The authors "translate the ap-
plication binaries into images, further convert it into sequences, and extract
local features for stealthy malware detection." Feng et al. proposed the method
DawnGNN using Graph Attention Networks (GAT) [8]. The authors proposed a
novel documentation-augmented Windows Malware Detection Framework. The
method works by converting the API sequences into API graphs to extract con-
textual information. The authors encode the functionality descriptions using
BERT and finally use Graph attention for classification. Li et al. proposed a
method to detect dynamic malware based on API calls [10]. The authors used
intrinsic features of the API sequence. The authors claim that this allows the
models to capture and combine more meaningful features. The authors then
use the category, action, and operation object of the API to represent the se-
mantic information of each API call. The authors do the classification using
a Bidirectional LSTM module and their results outperform the baselines. Cui
et al. proposed a graph-based approach to detect malware from API-based call
sequences [7]. The proposed method works by creating two graphs, a Temporal
Process Graph (TPG) and a Temporal API Graph (TAG) to model intra-process
behavior. A heuristic random walk algorithm then generates several paths that
can capture the malware behavior. The authors generate the embeddings using
the paths pre-trained by the Doc2Vec model. Chen et al. proposed a parameter-
augmented approach for the Microsoft Windows platform called CruParamer [6].
The method employs rule-based and clustering-based classification to compute
the sensitivity of an API-call parameter to malicious behavior. The classifica-
tion is done by concatenating the API embedding to the sensitive embedding
of the labeled APIs so that their relationship is captured in the input vector.
The authors then train a binary classifier to identify malware, and according
to the authors, their model outperforms naive models. Almousa et al. proposed
a method to identify ransomware attacks based on API calls [2]. The authors
initially studied the lifecycle of ransomware on the Microsoft Windows platform.
The next step was to extract malicious code patterns. The authors used data
from publicly available repositories and sampled the malicious code in a sand-
box. Machine learning models were built based on this analysis and yielded a
high detection rate.
There are also multiple public malware datasets based on API calls. Catak
et al. published a malware dataset that contained different malware types [4].
The dataset contains eight malware types, namely, Trojan, Backdoor, Down-
4 C. Fellicious et al.

loader, Worms, Spyware Adware, Dropper, and Virus, for 7107 samples. The
authors created the dataset using the Cuckoo Sandbox, available on GitHub.
Zhang published a dataset of 10,654 samples with sample labels [20]. The au-
thor divides the dataset into normal, ransomware, miner, DDoS Trojan, worm,
infective virus, backdoor, and Trojan. This dataset is from the Alibaba Security
Algorithm Challenge. Trinh created a much larger dataset of 1.55 million sam-
ples, of which 800,000 malware and 750,000 "goodware" samples are present [19].
Another dataset is from Oliveira with "42,797 malware API call sequences and
1,079 goodware API call sequences" [13]. The dataset comprises API call se-
quences of the "first 100 non-repeated consecutive API calls associated with the
parent process, extracted from the ’calls’ elements of Cuckoo Sandbox reports".
A few other open datasets from challenges, such as the Aliyun Malware Detec-
tion Dataset [18] exist, but most of the datasets are either unavailable publicly
or have a too narrow scope. More recent work is by Maniriho et al., where the
authors created their dataset [11]. The dataset consists of 1285 malicious and
1285 benign samples. The dataset is publicly available and hosted on GitHub.

3 Method
We structure this section into two parts. The first describes the dataset, its
creation, and corresponding statistics. The second explains the order-invariant
method we use to generate the results.

3.1 Dataset
A popular method to detect malware includes tracing the Application Program
Interface (API) calls. An API, or application programming interface, is a set of
rules or protocols that enables software applications to communicate with each
other to exchange data, features, and functionality 6 . We created this dataset
to address the gaps in the existing landscape of the current malware-based API
call datasets with the following properties,
– up to date with the current malware
– large and varied enough to cover most modern malware
– does not restrict the labels to a single category. Malware might not always
fall into a specific category alone. From a cybersecurity perspective, it makes
more sense to group samples by the malware family.
– a dataset collected from the real-world machines. We worked closely with G
DATA CyberDefense AG for data collection. Working directly with a cyber-
security firm gives us the advantage of knowing the malware is from the real
world.
In a Microsoft Windows environment, user processes interact with the Oper-
ating System using dynamically linked libraries (DLL). The Windows ntdll.dll li-
brary is one such library. "The NTDLL runtime consists of hundreds of func-
tions that allow native applications to perform file I/O, interact with device
6
https://www.ibm.com/topics/api
Malware Detection based on API calls 5

drivers, and perform interprocess communications" [12]. Any malware or, for
that matter, the user process will need to communicate with the NTDLL li-
brary, which makes it the perfect library for hooking our API call logger. This
library contains hundreds of functions related to different functionalities, such
as semaphores, threads, creating events or objects, and so forth. The library
belongs to the Native API and can call functions in user or kernel mode [5].
Therefore, we decided to trace the function calls to the ntdll.dll library. Tracing
all the functions to the ntdll.dll is cumbersome as documentation only exists for
a few functions. Therefore, we select a subset of 59 proven valuable functions of
the NTDLL library [16]. We obtained this set of function calls in cooperation
with the cybersecurity experts from G DATA CyberDefense AG.
We wanted to keep up with the latest malware trends, so we collected samples
for approximately six months between 01.05.2023 and 01.11.2023. During this pe-
riod, we undertook the monumental task of observing approximately 1, 000, 000
malicious samples. Malware occurs in waves in the real world. If a specific mal-
ware type is successful, it spreads like wildfire across different networks. This,
in turn, caused a spike in a few classes in our dataset, and we removed such
samples to avoid biasing any classifier to a few classes. We collected these sam-
ples (approximately 330k), and we grouped the samples into different malware
families. Labeling this data proved to be a challenge as well. However, the most
common form of labeling data for malware is grouping it into predefined classes
such as trojan, virus, backdoor, rootkit, and so on. But newer malware could
have overlapping patterns with multiple labels and having a single label might
not capture the complete behavior of the malware. Therefore, rather than group
them into a single category, such as a virus, backdoor, or trojan, we group them
into distinct malware families based on source code analysis from G DATA Cy-
berDefense AG. This means that malware belonging to the same label in our
dataset will exhibit the same properties and this should allow for better grouping
and analysis by cybersecurity researchers.
For a malware detection dataset, we also require data from benign software
as well. Acquiring benign software is not a problem at all, as there are millions
of verified benign software from different organizations all over the world. Our
problem was that most benign software required user interaction to install or
had a Graphical User Interface (GUI). The drawback of having such benign
software in the dataset was that delineating between malware and benign soft-
ware would be easy as most malware does not have a GUI. We needed benign
software that did not have a GUI and required no user interaction to execute.
This criterion alone made it similar to some types of malware as some malware
genres also require no input from the user [9,3]. Our cybersecurity partner firm
maintains a comprehensive whitelist of benign software, which includes a variety
of trusted applications and services regularly used in enterprise environments.
This whitelist was instrumental in ensuring the integrity and reliability of our
benign sample set. In addition to this whitelist, we included Microsoft service
executables, which are widely recognized as baseline components of the Windows
operating system. We aimed to cover a broad spectrum of typical, non-malicious
6 C. Fellicious et al.

software behavior by incorporating these executables. Including a diverse set


of benign software, sourced from our partner’s whitelist and Microsoft services,
significantly enhances the robustness of our dataset. This diversity is not just
a factor, but a key element for training machine learning models that need to
accurately differentiate between benign and malicious API call patterns. Our ap-
proach ensures that the benign samples are representative of real-world software
environments, thereby improving the reliability and effectiveness of the malware
detection system we developed.
We uniquely identify each malicious and benign sample in our dataset us-
ing a SHA value, which serves as a digital fingerprint for each executable. This
SHA value is crucial for distinguishing between different versions and variants
of software, particularly for malware samples that belong to the same family
but exhibit variations in their code. These differences can arise from factors
such as version updates, code rewrites, or intentional modifications designed to
evade detection mechanisms that rely on SHA analysis. By assigning a unique
SHA value to each executable, we can accurately track and manage the diversity
of samples within our dataset. This differentiation is significant for mutating
malware, which frequently changes its code to avoid detection. The ability to
identify and catalog these variations ensures that our dataset reflects the dy-
namic nature of real-world malware, enhancing the robustness of our detection
models. We utilize a controlled virtual environment to execute and monitor each
malicious sample. This approach allows us to observe the malware’s behavior
in isolation, preventing any potential damage to actual systems. During execu-
tion, we meticulously monitor and log all API calls made by the malware to
the ntdll.dll library. This process’s monitoring and logging tools are proprietary
to our cybersecurity partner, ensuring precise and secure data collection. These
proprietary tools are specifically designed to capture detailed API call traces,
providing a comprehensive view of the malware’s interaction with the operating
system. We then use the resulting data to construct a detailed profile of each
sample’s behavior, which is integral to developing our machine learning-based
malware detection system. By leveraging unique SHA values and advanced mon-
itoring techniques, we ensure that our dataset is extensive and precise, forming
a solid foundation for accurate and reliable malware detection.
We simulate an internet connection using an open source library 7 . Some
malware samples, after unpacking themselves, request the download of an exe-
cutable. We simulate the executable download by sending a predefined harmless
executable whose functionalities are inert and whose behavior is predictable. We
also monitor any direct child, determined by InheritedFromUniqueProces-
sId, from the EPROCESS struct [15]. Therefore, if the malicious sample forks,
we will still monitor all the child processes and log their API calls. Each such
sample, both malware and benign, is traced over a runtime of 360 seconds. The
execution environment is a Microsoft Windows 10 21H2 virtual machine with
4GB RAM and one vCPU. The logger then dumps all the calls and call param-

7
https://github.com/prskr/inetmock
Malware Detection based on API calls 7

{
" level " : " info " ,
" ts " : " 2023 -12 -04 T09 :23:36 Z " ,
" msg " : " Monitored function called " ,
" vmi_ts " : " 2023 -12 -04 T09 :19:09 Z " ,
" vmi_logger " : " A p iT ra ci n g_ Fu n ct io nH o ok " ,
" vmi_Parameterlist " :
[
{ " Sy ste mI nfo rma ti onC la ss " : 1 9 2 } ,
{ " SystemInformation " : 1 3 7 3 5 2 0 } ,
{ " S ys te m In fo rm a ti on Le n gt h " : 3 2 } ,
{ " ReturnLength " : 0 }
],
" vmi_FunctionName " : " N t Q ue r y Sy s t em I n fo r m at i o n " ,
" vmi_ModuleName " : " ntdll . dll " ,
" vmi_ProcessDtb " : " 65 aca001 " ,
" vmi_ProcessTeb " : " 2 e6000 "
}

Fig. 1. Sample log of a single API Call

eters into a JSON file named SHA. We show the sample trace of a single API
call in Figure 1.
We created an online repository containing all information about the dataset
and our code along with the names of the traced functions8 . Overall, the un-
compressed size of the dataset is approximately 572GB in total. We already
published the dataset on Zenodo9 , and the dataset is currently public access.

3.2 Order invariant Method

We aim to develop a malware detection method independent of the temporal


constraints and ordering of API calls. This approach ensures that the detection
system remains effective even when the sequence of API calls is altered, which
is a common evasion technique used by malware. To achieve this, it is crucial to
thoroughly investigate the impact on performance as we progressively increase
the sequence length of successive API calls under consideration.
Our proposed solution involves mapping each function call directly to a fea-
ture in the feature vector, with the value in each position representing the number
of times the sample invoked that particular function. This method allows us to
create a robust feature representation that is not influenced by the order of API
calls, focusing instead on the frequency of each function’s invocation.
We structure our experiments into four distinct parts to evaluate this ap-
proach comprehensively and ensure the validity of our findings. In the first part,
8
https://github.com/cfellicious/api-based-malware-detection
9
https://zenodo.org/records/11079764
8 C. Fellicious et al.

we examine each API call individually, disregarding the context provided by


previous and subsequent API calls. We refer to this as the Unigram model, a
concept borrowed from natural language processing (NLP). In NLP, a Unigram
model analyzes text by considering each word independently, without account-
ing for the sequence in which words appear. Similarly, in our Unigram model
for API calls, we treat each function call as an independent event, counting its
occurrences without considering its position in the sequence.
This initial experiment establishes a baseline understanding of how well in-
dividual API call frequencies can distinguish between benign and malicious soft-
ware. By focusing solely on the count of each function, we can determine the
effectiveness of this simple yet powerful feature representation in detecting mal-
ware. Subsequent parts of our experiments will build upon this foundation, pro-
gressively incorporating more contextual information to explore how the perfor-
mance varies following different lengths of sequences.
By employing this methodical approach, we ensure a comprehensive analysis
of the relationship between API call sequences and malware detection accuracy.
Our ultimate goal is to identify the optimal balance between feature complexity
and detection performance, ultimately developing a robust and efficient malware
detection system that is resilient to common evasion tactics.
For the Unigram approach, we simply map each function call to an array
index in the feature vector. We do this by creating a vector, V of length as
in Equation 1.
|V | = |Call1 , Call2 , ..., Calli , ..., Calln | (1)
Each dimension in V corresponds to a specific API Call, and the value of the
dimension is the number of API calls belonging to the specific function in a
particular sample as shown in Equation 2.

Vi = |Calli | (2)

When we consider the Bigram model, we also look at the immediately pre-
ceding API call for the second part. Furthermore, we do this using a sliding
window approach over the entire API call sequence. The length of the feature
vector in theory would be the total number of combinations of two API calls,
which is |V |2 (592 ).
Like the Unigram model, we create a feature vector for every combination of
API calls. The vector would comprise two consecutive calls we concatenate, like
Call1 Call2 . The total count will be the value for the feature at index i.

Vi = |θ(Calli−1 , Calli )| (3)

where, θ is a mapping from two consecutive API calls to an index. For the third
part, we consider two previous API calls for the sequence. We follow the same
procedure as the Unigram model with the only difference in the length of the
feature vector and the number of API calls considered.
And for the Trigram model, we consider three consecutive calls as given
in Equation 4. In this case, the size of the feature vector expands to |V |3 (=
Malware Detection based on API calls 9

593 ) which is quite large compared to the feature vectors of the Unigram and
Bigram models.
Vi = |θ(Calli−2 , Calli−1 , Calli )| (4)
where, θ is a mapping from three consecutive API calls to an index. The in-
dex mapping the unique function sequences to an index is provided as a JSON
file in our online code repository. The final model, which we call the Combined
model, consists of creating a feature vector that concatenates the feature vec-
tors of the Unigram, Bigram, and Trigram models. We do this in the hope that
the Combined Model exploits the positive aspects of Unigram, Bigram, and Tri-
gram models independently to obtain discriminating information from any of
the inputs of the models. The feature vector length for the Unigram model is
59 corresponding to the number of traced functions to the ntdll.dll, which is
manageable, but the Bigram and Trigram models have theoretical lengths of
3481 and 205379. Although the Bigram feature vector is manageable, it is still
quite large, and a Trigram-based feature vector is only possible for more than
330k samples on machines with substantial amounts of memory. Moreover, such
a feature vector could be sparse since most values would be zero. Therefore, we
efficiently identify all the unique bigram and trigram function calls and create
feature vectors using only those present in the dataset. There are 2540 unique
bigram calls and 5483 unique trigram combinations in the dataset. Therefore,
for practicality and to save memory, we limit the Bigram model and Trigram
model feature vectors to a length of 2540 and 5483 respectively. We then train a
random forest on these feature vectors and predict whether a given sample from
the test set is malware or benign. The dataset is unbalanced, containing many
more malicious samples than benign ones.

4 Results

The potential for a malware process to run indefinitely in a real-world setting


underscores the critical importance of early detection, providing security re-
searchers with a crucial advantage. Therefore, we run our experiment at different
API call counts to determine the best count that gives us the best balance be-
tween early prediction and performance. Our experiment was comprehensive,
considering a wide range of lengths: 50, 100, 150, 200, 250, 500, 750, 1000,
2500, 5000, 7500, 10000, 20000, 100000. Running the experiment for different
maximum API calls allows us to understand an executable’s minimum required
number of API calls to identify it as malicious or benign software. We plot this
against the F1-Score, the harmonic mean of precision and recall. We do not
consider accuracy in this case as this is a very unbalanced dataset, and simply
predicting the majority class gives us an accuracy value of above 96%.
We see from Figure 2 that the metrics are inferior for the initial value of
fifty maximum API calls. Remarkably, the performance sees a substantial leap
with just one hundred API calls, highlighting the potential for significant im-
provement. It’s clear from this analysis that with only a few hundred API calls
10 C. Fellicious et al.

per sample, we can confidently identify the sample’s malicious nature, demon-
strating the power of our analysis without the need for temporal information.
Fifty-nine unique API calls to the ntdll.dll library correspond to the fifty-nine
functions we mapped. Using these fifty-nine unique function calls, we find 2540
unique (where we consider two consecutive API calls) API calls and 5483 unique
trigrams (where we consider three consecutive API calls) API calls using a
sliding window approach. An exciting aspect from Figure 2 is in the trigram-
based model (where we consider a window of three consecutive API calls). The
F1-Score drops off after 200 API calls, and this is due to the large number of
unique sequences, which spreads the count out along the feature vector. Due to
the large number of unique occurrences, the number of calls tends to be unique
and the feature vector is a smear of mostly individual unique API calls. In

Fig. 2. F1-Score for all our models at different max API call counts. The X-axis is on
a logarithmic scale. The results are the average of four different runs.

Metric Unigram Bigram Trigram Combined


Accuracy 99.24 99.21 98.50 98.50
Precision 91.04 92.47 86.71 90.91
Recall 82.05 79.45 57.12 76.79
F1-Score 86.31 85.46 68.87 83.25
ROC AUC 0.9843 0.9881 0.9495 0.9812
Table 1. Metric values at a maximum of 2500 API calls. We chose 2500 API calls as
the maximum limit as it provided the best results across the board.

our case, we set the benign class to "1" and the malware class to "0". We do
this due to highly imbalanced data favoring the malware class. In a real-world
scenario, we do not want malware samples identified as benign. Therefore, we
need a very precise model so that no malware is identified as benign. Although
having a higher recall is good, in our case, it should not be at the expense of
precision. If our model has false negatives, it means that some benign software
Malware Detection based on API calls 11

was classified as malware and could warrant a closer look. However, having lower
precision might have drastic consequences. A precision-recall curve can help us
understand the model’s confidence in test data predictions. From Figure 3, we
see that the Unigram model (only single function call irrelevant of order) per-
forms very well along with the Bigram, Trigram and combined models. The only
exception is the Trigram model, where the precision drops drastically. From this,
having a Unigram model should be sufficient for identifying malware from benign
software. All our results and plots are given in our GitHub repository10 .

Fig. 3. Precision Recall curve for the Unigram, Bigram, Trigram, and combined models.
We see that the Trigram model confidence drops very quickly compared to the other
models.

From the different metrics and plots, we see that the Bigram model and the
Unigram model provide the best performance overall. However, the Unigram
model could be favored more simply due to the lower memory requirements and
faster execution.

5 Conclusion

We created a substantial corpus of API calls from recent malware and benign
software, with a specific focus on the ntdll Windows library, underscoring its
significance in our research. Our dataset, the largest of its kind available pub-
licly, is a unique and invaluable resource for the research community in malware
detection. Our experiments demonstrate that even the basic metric of function
call counts can significantly distinguish between malicious and benign software.
The created models show that by using good feature engineering techniques, we
can detect malware precisely with negligible performance overhead. Our method
has been proven to detect malware with a small number of API calls, demon-
strating its efficiency and practicality. Furthermore, we found that for a malware
detection system to be effective, it must analyze at least two hundred and fifty
10
https://github.com/cfellicious/api-based-malware-detection
12 C. Fellicious et al.

API calls to the ntdll.dll library. This threshold ensures a reliable level of de-
tection accuracy, reinforcing the importance of comprehensive data collection in
developing robust malware detection systems. Our findings also highlight the
potential of simple yet effective features in developing efficient and scalable mal-
ware detection solutions, paving the way for future research and advancements
in cybersecurity.

References
1. Aboaoja, F.A., Zainal, A., Ghaleb, F.A., Al-Rimy, B.A.S., Eisa, T.A.E., Elnour,
A.A.H.: Malware detection issues, challenges, and future directions: A survey. Ap-
plied Sciences 12(17), 8482 (2022)
2. Almousa, M., Basavaraju, S., Anwar, M.: Api-based ransomware detection using
machine learning-based threat detection models. In: 2021 18th International Con-
ference on Privacy, Security and Trust (PST). pp. 1–7. IEEE (2021)
3. Bilot, T., El Madhoun, N., Al Agha, K., Zouaoui, A.: A survey on malware detec-
tion with graph representation learning. ACM Computing Surveys 56(11), 1–36
(2024)
4. Catak, F.O., Ahmed, J., Sahinbas, K., Khand, Z.H.: Data augmentation based
malware detection using convolutional neural networks. PeerJ Computer Science
7, e346 (Jan 2021). https://doi.org/10.7717/peerj-cs.346, https://doi.org/
10.7717/peerj-cs.346
5. Chappell, G.: Native api functions (2024), https://www.geoffchappell.com/
studies/windows/win32/ntdll/api/native.htm
6. Chen, X., Hao, Z., Li, L., Cui, L., Zhu, Y., Ding, Z., Liu, Y.: Cruparamer: Learning
on parameter-augmented api sequences for malware detection. IEEE Transactions
on Information Forensics and Security 17, 788–803 (2022)
7. Cui, L., Cui, J., Ji, Y., Hao, Z., Li, L., Ding, Z.: Api2vec: Learning representations
of api sequences for malware detection. In: Proceedings of the 32nd ACM SIGSOFT
International Symposium on Software Testing and Analysis. pp. 261–273 (2023)
8. Feng, P., Gai, L., Yang, L., Wang, Q., Li, T., Xi, N., Ma, J.: Dawngnn: Doc-
umentation augmented windows malware detection using graph neural network.
Computers & Security p. 103788 (2024)
9. Gopinath, M., Sethuraman, S.C.: A comprehensive survey on deep learning based
malware detection techniques. Computer Science Review 47, 100529 (2023)
10. Li, C., Lv, Q., Li, N., Wang, Y., Sun, D., Qiao, Y.: A novel deep framework for
dynamic malware detection based on api sequence intrinsic features. Computers &
Security 116, 102686 (2022)
11. Maniriho, P., Mahmood, A.N., Chowdhury, M.J.M.: Api-maldetect: Automated
malware detection framework for windows based on api calls and deep learning
techniques. Journal of Network and Computer Applications 218, 103704 (2023)
12. Microsoft: Inside native applications - sysinternals (2024), https://learn.
microsoft.com/en-us/sysinternals/resources/inside-native-applications
13. de Oliveira, A.S., Sassi, R.J.: Behavioral malware detection using deep graph con-
volutional neural networks. Authorea Preprints (2023)
14. Shukla, S., Kolhe, G., PD, S.M., Rafatirad, S.: Stealthy malware detection using
rnn-based automated localized feature extraction and classifier. In: 2019 IEEE 31st
international conference on tools with artificial intelligence (ICTAI). pp. 590–597.
IEEE (2019)
Malware Detection based on API calls 13

15. Svitlana Storchak, S.P.: Vergilius project (2024), https://www.


vergiliusproject.com/kernels/x64/Windows%2010%20|%202016/2210%2022H2%
20(May%202023%20Update)/_EPROCESS
16. Test, A.: Inside the native api (2004), https://web.archive.org/web/
20121224002314/http://netcode.cz/img/83/nativeapi.html
17. Test, A.: Malware statistics & trends report | av-test (2023), https://www.
av-test.org/en/statistics/malware/
18. Tianchi: Aliyun malware detection dataset (2016), https://tianchi.aliyun.com/
dataset/dataDetail?dataId=137262
19. Trinh, Q.: 1.55m api import dataset for malware analysis (2021). https://doi.
org/10.21227/98jc-y909, https://dx.doi.org/10.21227/98jc-y909
20. Zhang, Z.: Malware api classification (2022). https://doi.org/10.21227/
ngvd-q378, https://dx.doi.org/10.21227/ngvd-q378

You might also like