0% found this document useful (0 votes)
10 views10 pages

Graph-Oriented Modelling of Process Event Activity For The Detection of Malware

This paper explores a novel approach to malware detection using Graph Neural Networks (GNNs) to analyze complex relationships within operating system components. By leveraging a representative dataset of malware and benignware interactions, the study demonstrates the effectiveness of GNNs in identifying malicious behavior patterns based on API call sequences. The findings highlight the potential of GNNs to improve malware detection accuracy compared to traditional methods, addressing the challenges posed by evolving malware variants.

Uploaded by

siyapkurandwad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views10 pages

Graph-Oriented Modelling of Process Event Activity For The Detection of Malware

This paper explores a novel approach to malware detection using Graph Neural Networks (GNNs) to analyze complex relationships within operating system components. By leveraging a representative dataset of malware and benignware interactions, the study demonstrates the effectiveness of GNNs in identifying malicious behavior patterns based on API call sequences. The findings highlight the potential of GNNs to improve malware detection accuracy compared to traditional methods, addressing the challenges posed by evolving malware variants.

Uploaded by

siyapkurandwad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE)

Graph-Oriented Modelling of Process Event Activity


for the Detection of Malware
2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE) | 979-8-3503-2759-5/23/$31.00 ©2023 IEEE | DOI: 10.1109/CSCE60160.2023.00085

* Regular Research Paper

1st Kenneth Brezinski† 2nd Ken Ferens


dept. Electrical and Computer Engineering dept. Electrical and Computer Engineering
University of Manitoba University of Manitoba
Winnipeg, Canada Winnipeg, Canada
[email protected] [email protected]

Abstract—This paper presents an approach to malware detec- or modified malware variants, while heuristic-based methods
tion using Graph Neural Networks (GNN) to capture the complex may generate many false positives. This emphasizes the need
relationships and dependencies between different components for a more effective malware detection approach as malware
of an operating system (OS). Traditional methods for malware
detection rely on known signatures of malware and may fail to attacks have become more sophisticated and evasive, leading
detect new or modified malware variants. GNNs offer a promising credence to a need for a more robust and accurate malware
solution by analyzing graph-structured data and identifying detection approach.
malicious behavior patterns. Specifically, this paper investigates Graph Neural Networks (GNN) offer a promising solution
the use of GNNs for malware detection based on the API to this problem by combining the ability to process large
call sequences of different event types, including File System,
Registry, and File and Thread activity. The paper presents a
datasets with ease with discriminative power - all the while
representative dataset of host process activity of malware collected generalizing well to samples. GNNs are a type of neural
in a custom sandbox environment, comprising over 239 malware network architecture that have gained popularity in recent years
executions with randomly executed benignware samples. The due to their ability to process and analyze graph-structured
paper then describes the GNN model trained on the dynamic data, which are increasingly prevalent in various domains,
process behavior generated from process execution graphs, with
independent models developed based on each category of API
including malware detection [2]. GNNs can be used to leverage
events. Finally, the paper presents a trained model that maximizes the graph structure of an Operating System (OS) to capture
the generalization performance of the model, demonstrating the the interactions between different system components and
applicability of GNNs for malware detection. This paper presents identify malicious behavior patterns. OS’s are constantly under
one of the first applications of GNN classification based on process threat from malware, and traditional methods for malware
hierarchy during malware execution that includes interaction with
benignware as well .
detection often fail to keep up with the increasing sophistication
Index Terms—Malware Detection, Graph Neural Networks, of malware attacks. GNNs offer a promising approach for
Operating Systems, Application Programming Interface, Artificial detecting and classifying malware based on their behaviors, as
Intelligence, Security they can capture the complex relationships and dependencies
between different components of the system.
I. I NTRODUCTION One of such components is the event activity of a process,
Malware detection is important due to the devastating conse- which can include the networking activity, edits of the registry,
quences that malware can have on individuals, businesses, and or the creation of a new file on disk or a new thread or
governments. For instance, malware attacks can compromise Mutex. The motivation for investigating event activity is that
sensitive data, steal financial information, and disrupt critical the behavior of malware can be best characterized by the
infrastructure. According to Malwarebytes 71% of companies independent steps that it carries out, and this is best captured
worldwide were targeted by some for of ransomware attack in through investigation of the Application Protocol Interface
2022, and on average Malware variants such as Emotet has (API) sequence [3, 4]. More specifically, this study can reveal
cost state, local, tribal, and territorial governments up to 1 behavioral-interaction-fingerprints that interacting processes
million USD per incident to remediate [1]. have with the graph structure, which can be used to differentiate
Some of the shortcomings of traditional signature-based and and classify different processes. In this paper we explore
heuristic-based malware detection methods are that they rely on the use of GNNs for malware detection based on the API
known signatures of malware in order to detect future variants. call sequences of different event types. Our work makes the
For instance, signature-based methods may fail to detect new following overarching contributions:
• a representative dataset is developed of host OS process
† Corresponding Author; kbrezinski.github.io activity collected in a custom sandbox environment that

979-8-3503-2759-5/23/$31.00 ©2023 IEEE 477


DOI 10.1109/CSCE60160.2023.00085
Authorized licensed use limited to: RV University. Downloaded on March 17,2025 at 05:31:34 UTC from IEEE Xplore. Restrictions apply.
comprises over 239 malware executions with randomly Recent studies have shown that event activity of a process,
executed benignware samples with malware; such as the API sequence, is a critical component for detecting
• a GNN model with attention is trained on the dynamic and characterizing malware behavior [11, 12]. In particular,
process behavior generated from process execution graphs, analyzing the API call sequence of different event types can
and independent models are developed based on File System, reveal the independent steps carried out by malware and enable
Registry and File and Thread activity; more accurate detection and classification. GNNs have been
• a trained model is developed that maximizes the general- used to effectively capture the behavior of malware based on
ization performance of the model, thereby demonstrating the API call sequences [8]. The authors used Cuckoo as well
the applicability of GNNs for malware detection based on to generate the directed cyclic graph and weighted according to
different categories of API events; the proportion of API i called between nodes n and m. In this
We will first provide an overview of GNNs in Section II and fashion a merged graph which satisfies the Markov property
their application in various domains including malware detec- was generated and combined with the adjacency matrix of
tion. In Section III we introduce the sandbox used to collected the API calls. In our previous work [13] API call sequences
malware and benignware, along with the API vectorization were fed into a transformer whereby the sequences of API
used and the architecture and design considerations for the calls were learned using self-attention. Self-attention is not to
GNN. This is followed by Section IV where the performance be confused with the attention mechanism described thus far,
of a GNN-based malware detection systems on a real-world as self-attention relates to the learning of the sequences by
dataset representing some of the most recent malware variants propagating the vectors of the APIs with each other to learn
of concern will be evaluated. contextual information [14, 13]. In [4], the authors grouped
suspicious behavior into 9 categories, ranging from searching
II. R ELATED W ORKS for file to infect to distributing virtual memory. All in all, the
API calls associated with a particular behavior were modeled
Malware attacks continue to pose a significant threat to using finite automaton to trace suspicious behavior. An API
individuals, businesses, and governments, leading to devastat- tracer was used on 914 samples and following classification a
ing consequences such as data breaches, financial loss, and final precision of ≈94% was obtained. Similarly to [8] the work
disruption of critical infrastructure [5]. Traditional signature- focused on sequences provided by a single sample, as opposed
based and heuristic-based malware detection methods have to a robust host environment populated with benignware and
limitations in detecting new and modified malware variants malware. It is also the case that the work did not report Recall
and generating false positives, highlighting the need for more or F1 score, meaning the model may have very poor coverage
effective approaches to malware detection [6]. of the totality of the malicious class.
GNNs have emerged as a promising solution for detecting Overall, the use of GNNs for malware detection based on
malware by leveraging the graph structure of an operating the API call sequences of event types has shown great promise
system to capture the interactions between different system in recent years. While the applications of GNNs have been a
components and identify malicious behavior patterns [7]. GNNs profound benefit to the field of anomaly detection for malware
can process and analyze graph-structured data with ease, detection, the datasets provided do not mimic real life host
making them well-suited for malware detection tasks. For environments. This is due to the simple fact that tools such as
instance, in [8], the authors proposed a GNN-based approach Androguard or Cuckoo are already established frameworks by
for malware detection that captures the structural and semantic which API call graphs can be extracted with little setup. This
features of the system call graph to identify malware samples does not provide the dynamism of a host OS which includes
based on use of Markov chains and Principal Component both benignware, OS processes as well as malware all running
Analysis for feature extraction. The results demonstrated that simultaneously [13]. For this reason, this work bridges that gap
their approach outperformed traditional machine learning-based by introducing a representative dataset that incorporates the
methods in F1-Score. In total their work included 13,624 API call sequences of all processes simultaneously in a single
samples executed in a Cuckoo sandbox, which only includes model. This improves on existing work by expanding beyond
the isolated samples executed without other processes running - a single process call graph and investigating the entirety of the
which severely limits applicability as processes do not not run API call sequences of the process hierarchy. The next section
in isolation on host OSs. Similarly, in [9], the authors proposed will discuss some of the innovations made in this respect on the
a GNN-based approach that leverages the attention mechanism generation of this dataset and the proposed GNN architecture
to capture the important features of the API call graph from for classification.
Android APKs. The results showed that their approach achieved
high accuracy and outperformed other machine learning-based III. M ETHODOLOGY
methods, including a simple Graph Convolution Network This section will provide the theoretical basis for the devel-
(GCN) without attention. This use of attention will be explored opment of a GNN architecture for the process classification of
in this work as well, as it has been shown that the use of malware, as well as the representative dataset used to model
attention in a GNN can dynamically learn the importance of a dynamic host environment complete with benignware and
nearby neighbors in a graph framework [10] malware process activity. First, discussion of the sandbox

478

Authorized licensed use limited to: RV University. Downloaded on March 17,2025 at 05:31:34 UTC from IEEE Xplore. Restrictions apply.
environment in Section III-A and how the malware samples executables. First, consider the set of malware samples M and
were collected; followed by information on the sampling benignware samples C. If each malware executable were to
procedure used to generate a representative dataset in Section be executed once, then M total trials or executable graphs are
III-B. This follows into discussions on the feature vectorization being populated with process activity. Let cm represent the
of the API sequence in Section III-C followed by the GNN subset of benignware executables cm ⊂ C for execution m. At
theory and architecture design in Section III-D and III-D, each trial m, the chance of drawing a particular benignware
respectively. sample is (1/|C|)E[pc ] with replacement; where E[pc ] is the
expectation of drawing samples with probability pc . Therefore
A. Sandbox Process Execution Collection for M trials the chances of not drawing a particular benign
Malware and benignware samples are collected in a custom sample is governed by Eq. 1; where the |C| − i term takes
sandbox environment as to simulate a real host environment. into account the fact that replacement can not occur as no
Careful attention is taken to ensure the dynamic behavior of executed can be executed twice on a clean snapshot. This
malware is captured and cached for downstream tasks. To exercise is simply to demonstrate the case that some benignware
this end, we capture process event activity belonging to File samples are not used to train the model based on the sampling
System, Process and Thread Activity, as well as Registry events. technique used. This provides a good generalization of potential
Networking was not included as Networking activity uses host environments without introducing bias into the dataset if
similar networking sockets as benignware, and many malware, the same processes were chosen manually for each malware
as a part of their anti-emulation procedure, fail to reach out execution graph.
to the network at all. In virtual environments many malware
E[pc ]  
variants become Virtual Machine aware and fail to execute their  1
M 1− (1)
viral payloads; therefore, the signatures of the malware which
i=1
|C| − i
probe the current environment for signs of virtualization are as
part of the malicious payload as the payload itself. Additionally, In the sandbox environment the samples were automatically
we wish to capture and detect anomalous behavior before run through the use of a batch script which helps to automate
secondary payloads are fetched from the internet - which further the process and provide consistent executions between malware
complicates the process of labelling processes. For the sake of samples in terms of time window. A short script is used to
brevity, we refer readers to our previous work in [13] which run Procmon and load relevant configuration files and filters
provides a complete overview of the sandbox environment and to begin collection. The filter files (.pmc files) are used to
configuration. exclude some Procmon specific events from appearing in the
list of captured events, but all other events, including windows
B. malware Sampling and Run-time Configuration operating system behaviour, is captured. This ensures there is
The malware samples used for execution were drawn from no bias introduced in the collected event activity by the author.
a repository of recent malware samples obtained from Virus- A list of the rules used in the Procmon filter can be found in
Total1 . VirusTotal provides researchers a repository of 10’s of Table IV Appendix V-0c.
thousands of malware samples identified and bundled in the last
C. API Call Sequence Vectorization
quarter year to represent new and unique infections submitted
to VirusTotal through an Academic License. In this work 239 In this work process APIs were vectorized according to a
malware samples were retrieved from Q4 2022. This dataset of simple scheme combining both N -grams with tf-idf. To begin,
malware samples was filtered to remove non-Windows malware, N-grams looks at all the unique n number sequences of APIs,
and includes both 32-bit and 64-bit executables. A full list and creates a feature vector with the numeration of the unique
of malicious executables, complete with file sizes and MD5 sequence. N-grams improves on Bag of Words by accounting
hashes, can be found in the Github repository for this work. for pairs - in the case of bigrams - or trigrams in the case of 3,
Alongside malware, benignware is executed in tandem as to of unique API sequences of length n. These sequences are taken
simulate a real host environment. During malware execution, from the stack traces, where they appear in order from low-
3 - 5 processes are randomly selected from the benignware memory space to high-memory space (0xfffff) in the stack.
list and run sequentially. This is to populate the execution In Fig. 1 we see an illustration for a stack trace that has been re-
graph with noise and negative training samples, and ensures solved using the Windows Symbol Table to produce the names
the dataset mirrors the OS environment of a real host as closely of the API calls [3]. In this example LoadLibraryExa
as possible. In total, 300 benignware samples are collected and LoadLibraryA appear after one-another, therefore the
from the cnet.com Apps for Windows category representing tuple (LoadLibraryExa, LoadLibraryA) is added to
popular windows applications. All benignware was confirmed the corpus. Then the next element in the sequence would
to have an AV score of 0 according to VirusTotal. include BaseThreadInitThunk which would be added to
A systematic approach was used to stochastically sample the corpus with the previous element as (LoadLibraryA,
negative training examples from the benignware category of BaseThreadInitThunk). This captures all unique com-
binations of Windows APIs, and maintains some of the
1 www.virustotal.com/ order in the sequences. When considering trigrams we would

479

Authorized licensed use limited to: RV University. Downloaded on March 17,2025 at 05:31:34 UTC from IEEE Xplore. Restrictions apply.
incorporate more information of the sequence by considering D. Introduction to Graph Neural Networks
all combination of three (n = 3) APIs in succession. Node feature vectors are prepared from a directional flow
graph G. We define V as the set of vertices on the graph
which represent spawned processes on a host OS. Each vertex
has a set of node features, h = {h̄1 , h̄2 , . . . , h̄N }, h̄i ∈ RF
where N is the number of nodes in the graph and F is the
number of features for each node. In Fig. 2 we observe four
nodes that are neighbors. For example, if h̄2 is the node under
consideration it shares a neighbour with h̄2 and h̄3 but not
with h̄4 . Therefore, for any ith node we want to consider the
influence of its neighboring nodes j ∈ Ni .

Fig. 1. Translation from memory locations of imported DLLs to Windows


API function calls. Retrieved from [3].

The decision to select larger n comes down to data sparsity.


In a previous work the number of Windows APIs used by
event processes were noted to be in the range of 1500-1700
[3]; which is a relatively low number compared to typical
Fig. 2. Graph structure of a 4 node cluster, complete with attention coefficients
Natural Language Processing (NLP) problems. Larger values αij between any given nodes i and j.
of n mean larger and larger sequences which will occur less
frequently, meaning the model may overfit on the training data In Fig. 2 we also observe coefficient aij , where i, j
as it will generalize poorly. Conversely, a smaller n will lose correspond to the indices of neighboring vertices, which acts
the ability to trace longer sequences and will be unable to learn as an attention mechanism, known as the attention coefficient,
the importance of classifying maliciousness, thereby causing which dictates the amount of influence to be paid between
the model to underfit. neighboring nodes. In Section III-F we will discuss the role
Term Frequency–Inverse Document Frequency (tf-idf) takes these coefficients play and how they are computed. For now
the vectorization one step further by attempting to learn the we will explore how we can make use of our feature vectors
importance of words by considering their inverse-frequency of through the the use of linear projections.
occurrence. The motivation for this technique is that more fre-
quent APIs are less important in distinguishing maliciousness, E. Projecting Higher-Order Features
while greater attention should be paid towards less frequent To create a higher-order representation of our features we
APIs. Theoretically this technique would be more powerful apply a simple linear transformation. In Fig. 3 the feature
in attending to rarely used APIs that malware will use but matrix, h, is created using a simple Bag-of-Words (BoW) feature
benignware does not, while at the same time, disregarding representation by numerating whether or not a node used a
commonly used APIs that are routine to the functionality of particular Windows API or not. This representation is used just
any running process. Calculating tf-idf is a straightforward for demonstrative purposes. If a process uses the particular API
process and can be done according to Eq. 2. call it is incremented by 1, otherwise it is given the default
value of 0 meaning it was not used. Using a weight matrix,
 
N W ∈ Rn×E where E is the embedding dimension and n is
idftd = tftd × log (2)
dft + 1 the number of features, we can compute the embeddings for
node j using the dot product between W and h̄j - producing
Where tftd accounts for the proportion of the API, t, used by Wh̄j . These embeddings are learned and provide the model
the process d by dividing the count of t by the total number of the ability to learn which features - in this case Windows APIs
APIs used by d. dft keeps track of the number of occurrences of - are important in considering the maliciousness of nodes.
t by all N processes. The logarithm has the effect of dampening
the explosion of the numerator for high values for N . When dft F. Attention Coefficient
gets larger for more frequent APIs, the term in the logarithm The attention coefficient combines the influence of nearby
approaches 1 and the idftd is small. When dft is small for nodes and allows every node to attend to every other node
rarely used APIs the idftd becomes larger and is given a larger on the graph framework. The expression for calculating aij
weight. Through this preprocessing step APIs can be vectorized between nodes i and j is shown in Eq. 3; where w̄T is a
in such a way where malicious API usage can have a greater trainable weight vector that is transposed, || signifies the double
impact for process classification. pipe operator which stacks the transformed feature vectors Wh̄

480

Authorized licensed use limited to: RV University. Downloaded on March 17,2025 at 05:31:34 UTC from IEEE Xplore. Restrictions apply.
ancestor with p2i . As a result, all p2i processes that exist would
share the same embedding information from p0 , leading to
all the process nodes of p2n becoming saturated with similar
information. This would impact performance negatively, and
is best resolved through domain knowledge of the dataset
used and carefully considering the amount of capacity of your
model. A visualization of this effect is shown in Fig. 4. We
can see the difference in a 1-hop aggregation (Fig. 4(left)) and
Fig. 3. Creation of the node embeddings via the dot product between the
BoW API matrix and the weight matrix W. a 2-hop aggregation (Fig. 4(right)) whereby the malicious red
node p21 is smoothed over with information from p0 , leading
to a node embedding that is more similar to that of its benign
along the second axis, and LeakyReLU () applies the leaky counterpart p22 . This is illustrated by the proportion of red,
ReLU activation function. orange and green in its node embedding, which geometrically
    would coincide with the vectors p21 and p22 being closer or
exp LeakyReLU w̄T Wh̄i ||Wh̄j farther apart in d-dimensional space.
aij =     (3)
k∈Ni exp LeakyReLU w̄
T Wh̄ ||Wh̄
i k

The use of leakyReLU has the effect of only considering the


positive influences of the node feature vectors, while allowing
the weight to possibly recover it were driven to less than
0 where the derivative evaluates to 0 in a normal ReLU .
Additionally, the use of exp() in the numerator and denominator
implements a sof tmax which ensures the attention coefficients
for node aij ∈ [0, 1] are normalized; as the denominator will
always evaluate to 1 when considering all possible neighboring
nodes Ni for node i. When we compute aij for all neighboring
nodes, we can then proceed to calculate the transformed feature Fig. 4. Process hierarchy for a directed graph, where the red, orange and
representation according to Eq. 4. Therefore all neighbors to green coloured nodes refer to malicious, suspicious and benign processes,
respectively. Crude illustration of node embeddings after message passing and
i, j, where j ∈ Ni are considered through their respective aggregation for (left) 1-hop neighbourhood and (right) 2-hop neighbourhood.
aij values; with a final sigmoid being used to introduce non-
linearity. A less verbose explanation and original derivation of Additionally, for each feature set belonging to each event
GATs can be found in the original work [10]. type, a corpus is created for the API calls based on the
⎛ ⎞ discussion in Section Section III-C. In Table I we can view the
 size of each corpus, which is the size of the node feature vector
h̄i = σ ⎝ aij Wh̄j ⎠ (4)
which is being fed into the model. Additionally, Table I provides
j∈Ni
the corpus sizes for 1 2 and 3-grams - representing all unique
G. Graph Neural Network Architecture combinations of n APIs in sequence. What is noteworthy is
Understanding the topology of the process execution is that the corpus sizes are much larger that previous related work
important as a preliminary step for several reasons. First, in [3] which recorded corpus sizes in the 1500-1700 range. The
process graph topology is fundamentally different than that of previous work only looked at the stack traces of 9 malware
a social network or a citation network, and knowledge of the executables, while this work looks at over 200; meaning we
topology leads to better decision making in the architecture are covering a more recent swathe of potential threats and
design. For example, if processes on average spawn 1 to behaviours in our dataset. The previous work additionally did
2 daughter processes, then creating a deeper GNN based not look at benignware executables - which further exemplifies
on 3-hop neighbours (a 3-layer GNN) would be redundant the novelty and the necessity of this work. Looking at longer
and lead to over saturation - or a smoothing effect over the sequences has the benefit of encoding more useful information
embedding space. As an example of this impact, lets denote of the behaviour of the process via long-range dependency, at
any ith process pji where j refers to the hierarchy of the the cost of introducing noise and added dimensionality and
process, or depth if one were to consider some form of n- complexity. The corpus sizes in Table I directly relate to the
ary tree. If a user uses explorer.exe, denoted as p1i , to size of the initial input dimensions in the GNN architecture.
execute a malware executable p2i , then we can use GNNs In a GNN architecture the model needs to learn from various
to share the event behaviour of any process p2i → p1i via topologies in order to learn the contexts of malicious and benign
message passing. However, the OS parent process that spawned node frameworks. For this reason the model will be trained in
explorer.exe, lets call it p0 which acts as a root node, a deductive setting, where nodes are batched into training and
would not be helpful to understand the behaviour of p2i since validation sets (80/20 split) as shown in Fig. 5. For a given
there are many processes that are Benign which share an topology the model will be fed with a list of edges, E2×|E| ,

481

Authorized licensed use limited to: RV University. Downloaded on March 17,2025 at 05:31:34 UTC from IEEE Xplore. Restrictions apply.
TABLE I TABLE II
S UMMARY TABLE OF THE NUMBER OF API CALLS BELONG TO THE CORPUS S UMMARY NETWORK STATISTICS FOR THE SANDBOX M ALWARE
FOR DIFFERENT n GRAMS . T HE E VENT T YPE All REFERS TO A EXECUTION GRAPHS . VALUES ARE AVERAGED OVER ALL M ALWARE
COMBINATION OF R EGISTRY, F ILE AND T HREAD EVENT API S INTO A EXECUTIONS (n = 239).
SINGLE CORPUS .

Metric Value
Event 1-gram 2-gram 3-gram
# Nodes 40
Registry 1917 6517 10985 # Edges 112
File 2017 6018 9669 Avg. Node Degree 6.70
Thread 612 1506 2200 Min Node Degree 2.33
All 2921 10656 18267 Max. Node Degree 11
Avg. Degree Connectivity 4.75
Degree Centrality 0.14
Degree Assortativity -0.20
Density 0.07
Square Clustering 0.32

d) Avg. Degree Connectivity: This is the average degree


Fig. 5. Training (left) and Validation (right) sets for nodes randomly selected connectivity of a node in the network. Degree connectivity is
at each iteration. a measure of how well connected a node is to its neighbors. In
this case, the average degree connectivity is 4.75. The degree
derived from an adjacency list connecting source nodes to connectivity of a node is calculated as the number of neighbors
destination nodes using PIDs and PPIDs, respectively. Based of the node with degree greater than or equal to the degree
on the aforementioned considerations for GNN design discussed of the node, divided by the total number of neighbors of the
in this section, a summary of GNN model parameters can be node. For example, if a node has four neighbors with degree
found in code snippet 1 in Appendix V with additional details 2, two neighbors with degree 3, and one neighbor with degree
for training configuration. 4, its degree connectivity would be 7/7 = 1. This is in contrast
to Avg. Node Degree, which is simply the sum of the degrees
IV. R ESULTS AND D ISCUSSION of all nodes in the network, divided by the number of nodes.
Based on the aforementioned design considerations outlined e) Degree Centrality: This is a measure of the importance
in the previous section, an overview of the malware topologies of a node in the network based on its degree. In this case, the
is briefly covered in Section IV-A. This is followed by the degree centrality is 0.14. Degree centrality can be calculated
main performance metrics compared for various datasets and as the degree of a node divided by the maximum possible
model architectures in Section IV-B; followed by some final degree in the network. For example, in a network with 10
discussions on model variance in Section IV-C. nodes, the maximum possible degree is 9 (since a node cannot
be connected to itself). If a node has a degree of 6, its
A. Node Topology and Feature Characteristics degree centrality would be 6/9 = 0.67. Given the highly inter-
This section briefly summarizes the structure and attributes connectivity of the graph, a value of 0.14 is fairly high.
of the Sandbox execution graphs, first through through a f) Degree Assortativity: This is a measure of the degree
discussion of network metrics as it pertains to the topology of homophily in the network. Degree homophily is the tendency
the sandbox Malware execution graphs. A table summarizing of nodes to connect to other nodes with similar degree. In this
the network metrics is shown in Table II. A summary of case, the degree assortativity is -0.20, which means that nodes
these network statistics and their meaning follows. Many of with high degree are more likely to connect to nodes with low
these network statistics play a role in the architecture design degree, and vice versa. This value would be larger, or even
of the GNN as understanding the network is a precursor to positive, in the case of a larger network topology where servers
establishing effective message passing layers. are communication with each other to form of the backbone
a) Avg. Node Degree: This is the average degree of a node of the internet. In the case of process execution, there is no
in the network. The degree of a node is the number of edges such behaviour to be observed as core OS behaviour tends to
connected to it. In this case, the average node has a degree of spawn daughter processes, but not other core OS processes.
6.70, meaning the process topology is higher interconnected. g) Density: This is a measure of the number of edges
b) Min Node Degree: This is the minimum degree of a in the network relative to the maximum number of edges that
node in the network. In this case, the minimum degree is 2.33. could exist. In this case, the density is 0.07, which means that
c) Max. Node Degree: This is the maximum degree of there are relatively few edges in the network. Once process
a node in the network. In this case, the maximum degree is execution is carried out, then it does not make intuitive sense
11. This node would belong to a process such as cmd.exe or for the daughter process to spawn a process already running.
explorer.exe in the case of a execution initiated by the It is even the case that through the use of mutexes processes
user using the OS Graphical User Interface (GUI). are not allowed to execute once already running.

482

Authorized licensed use limited to: RV University. Downloaded on March 17,2025 at 05:31:34 UTC from IEEE Xplore. Restrictions apply.
h) Square Clustering: This is a measure of the clustering TABLE III
S UMMARY MODEL PERFORMANCE METRICS FOR THE VALIDATION SET.
coefficient of the network. The clustering coefficient is a M ETRICS ARE COMPUTED OVER 20 ITERATIONS . E LEMENTS BOLDED
measure of the number of triangles in the network. In this SIGNIFY THE BEST PERFORMING MODEL FOR EACH DATASET. NAMING
case, the square clustering is 0.32, which means that there SCHEME model-l-d REFERS TO THE model (GAT - G RAPH ATTENTION ;
GCN - G RAPH C ONVOLUTION W / O ATTENTION ; ANN - L INEAR L AYER )
is a relatively high level of clustering in the network. This WITH l LAYERS AND d HIDDEN NEURONS IN EACH LAYER . U NLESS
follows from the Degree Connectivity metric that noted the OTHERWISE NOTED , THE MODELS USE THE F ILE S YSTEM EVENT TYPE
high inter-connectivity of the network. WITH A unigram VECTORIZATION WITH AN l = 2 AND d = 64.

Model Loss ↓ Accuracy (%) ↑ F1 (10−2 ) ↑


B. Graph Neural Network Performance on Sandbox Execution
GAT-File 1.05 ± 0.4 95.87 ± 1.9 93.64 ± 2.2
Graphs GAT-Registry 1.19 ± 0.7 94.85 ± 2.2 91.81 ± 2.6
GAT-Thread 1.84 ± 0.3 95.22 ± 2.2 92.72 ± 2.0
Table III summarizes the validation loss, as well as balanced GAT-All 5.02 ± 0.3 75.23 ± 2.0 93.29 ± 2.2
accuracy and F1 Score for several GNN architectures and
unigram 1.14 ± 0.5 96.13 ± 1.0 96.39 ± 0.1
datasets. For a description of these performance metrics, the biggram 1.31 ± 0.3 86.96 ± 1.0 95.15 ± 1.1
authors refer the reader to Section V. First and foremost, the trigram 1.39 ± 0.2 84.91 ± 1.3 95.55 ± 1.2
comparative performance for the different event types is shown GAT-2-32 1.27 ± 0.3 94.62 ± 1.0 94.91 ± 0.4
in the upper section of Table III. The File System event type GAT-2-64 1.14 ± 0.3 96.13 ± 1.6 96.39 ± 1.6
GAT-2-128 1.05 ± 0.3 96.00 ± 2.5 90.92 ± 2.2
was the best performing feature dataset across the board (93.64 GAT-3-64 1.40 ± 0.4 94.53 ± 1.1 95.15 ± 1.2
vs 91.81 and 92.72 F1). This coincides with the fact that File GCN-2-64 1.60 ± 0.4 94.76 ± 1.0 89.98 ± 1.0
System usage by malware is characteristically different than ANN-2-64 1.88 ± 0.3 90.44 ± 0.8 82.98 ± 1.0
benignware. This can be due to several reasons, but mainly the
creation and deletion of files on disk occurs with an irregularity
that isn’t the case for benignware as malware attempts to cover of certain sequences penalizes the model more than it benefits
it tracks on the OS. Its also the case that for Registry events based on these results.
certain querying and alteration of registry keys is indicative of Finally, an investigation on different model depths and
persistence which occurs for both benignware and malware as capacities was carried out. This included investigating 2-hop
many processes look to set registry keys upon startup. It is also and 3-hop neighborhoods (noted using the value of l for the
the case that other event type usage are too similar between number of hops), as well as different hidden neurons in each
malware vs benignware, albeit by a small margin as the results layer, d. Table III demonstrates 128 hidden neurons does lead
all show comparative performance within error. This is the case to better loss but not necessarily an improvement in metric
with the sole exception of GAT-All which combines all the score that is statistically significant. This is because the model
APIs from all event types. The GNN model loses the ability to converges during training and the model does not learn anything
learn the relationships between processes when all event types new with additional capacity. In addition, 2-hop and 3-hop
are used as it attempts to learn a general representation of neighborhoods were tested (GAT-2-64 and GAT-3-64), and
behavior as opposed to one focused on a single event type. It a 3 layer GNN (l = 3) noted worse performance in loss
is also the case that a larger feature representations are harder and accuracy (1.40/94.53 vs. 1.14/96.13, respectively). As
to learn, and may require running for more epochs and model previously discussed in Section III-D, deeper GNNs leads to
fine-tuning. over-saturation of the embedding space as malicious processes
Additionally, unigram was tested and compared with both a incorporate information of their benign counterparts. For
bigram and trigram model. The unigram feature representative comparison purposes and as a baseline, we compared results
was optimal by a large margin (96.39 vs 86.96 and 84.91 in for a simple artificial neural network (ANN-2-64) which
Accuracy), and this can mainly be attributed to over-fitting. contains no message passing aggregation along with a GNN
While longer sequences do encode more information about the without attention (GCN-2-64). For both types the model had
behavior of the malware through a sequence of APIs, it has diminishing performances as the message passing is pertinent
the side effect of relying too much on certain sequences and in the training of embeddings for a GNN which the ANN does
generalizing poorly to the validation set. Better regularization not have; and the attention mechanism can learn the importance
is one technique which can minimize the effect of over- of nearby neighbors which a simple graph convolution does
fitting, which in this work was already implemented with not have.
L2 regularization. Also, a larger feature representation is also
C. Discussion on Model Variance
more difficult to train and has similar shortfalls as mentioned
previously with the GAT-All performance. Aside from the One takeaway from the model performances is that models
issue with over-fitting, a unigram model may be optimal in have larger variances between model runs. This is evident in
cases where the single use of APIs is salient enough such that the high standard deviations in model loss, accuracies and F1
the model learns malicious behavior based on their usage and Scores shown thus far in this section. The reason for this is due
frequency of usage. Since tf-idf was used, rarely used APIs are to the sampling procedure used and some of the considerations
given a greater weight and for bigram and trigram the rarity touched on in Section III-B. Some of the process execution

483

Authorized licensed use limited to: RV University. Downloaded on March 17,2025 at 05:31:34 UTC from IEEE Xplore. Restrictions apply.
graphs are populated with benignware activity that is either Behavior Identification. In Proceedings of the 2009 First
executed twice in two different graphs and some samples which International Workshop on Education Technology and
are not present at all. This has the effect of the model learning Computer Science - Volume 02, ETCS ’09, pages 198–
very different execution graphs which adds to the robustness of 202, USA, March 2009. IEEE Computer Society.
the model but also leads to large deviations between iterations. [5] Kenneth Brezinski and Ken Ferens. Metamorphic Mal-
Secondly, the largest source of model variance is the selection ware and Obfuscation -A Survey of Techniques, Variants
of the training and validation set. To overcome effects due to and Generation Kits. Security and Communication
sampling bias (i.e. the manual selection of samples to validate Networks, 2023.
on which introduces bias), the positive and negative training [6] Daniel Gibert, Carles Mateu, and Jordi Planes. The
examples that fall into the validation set are randomly selected rise of machine learning for detection and classification
for each of the iterations that a new model is trained. These of malware: Research developments, trends and chal-
iterations are different than epochs: where at each iteration a lenges. Journal of Network and Computer Applications,
new model is initialized with random weights, a randomized 153:102526, March 2020.
training and validation set, and a new random seed; whereas [7] Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang,
during each epoch the model is trained and validated on the Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li,
same model and training and validation set. So for each iteration and Maosong Sun. Graph neural networks: A review of
the model experience is very different and this can lead to methods and applications. AI Open, 1:57–81, January
significant model variance. This is because depending on the 2020.
iteration, if a hard to classify malicious executable is presented [8] Shanxi Li, Qingguo Zhou, Rui Zhou, and Qingquan Lv.
in the validation set, it would not have been trained on in the Intelligent malware detection based on graph convolu-
training set, and thus be hard to classify. In another iteration tional network. J Supercomput, 78(3):4182–4198, 2022.
it would be present in the training set, in which case it would [9] Cagatay Catal, Hakan Gunduz, and Alper Ozcan. Malware
not belong to the validation set and the performance scores Detection Based on Graph Attention Networks for Intel-
would improve. Repeated iterations tend to smooth over this ligent Transportation Systems. Electronics, 10(20):2534,
effect as the effect averages out, but this presents a persistent January 2021. Number: 20 Publisher: Multidisciplinary
effect due to the nature of sampling over 200 independent Digital Publishing Institute.
process execution graphs with heterogeneous topologies. This [10] Petar Veličković, Guillem Cucurull, Arantxa Casanova,
work already applies Regularization and Early Stopping to Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph
ensure the validation performance is as close as possible to Attention Networks. arXiv:1710.10903 [cs, stat], Febru-
the training performance to avoid over-fitting [15, 16]. One ary 2018. arXiv: 1710.10903.
other remedy is to increase the size of the dataset which is [11] Tristan Bilot, Nour El Madhoun, Khaldoun Al Agha,
always a solution to problems with over-fitting leading to poor and Anis Zouaoui. A Survey on Malware Detec-
generalization in deep learning research. But largely due to tion with Graph Representation Learning, March 2023.
the time commitment and in other cases the cost of manually arXiv:2303.16004 [cs].
labelling training examples, this is infeasible and impractical [12] Youngjoon Ki, Eunjin Kim, and Huy Kang Kim. A
in practice [17, 18, 19]. Novel Approach to Detect Malware Based on API Call
Sequence Analysis. International Journal of Distributed
V. ACKNOWLEDGEMENTS Sensor Networks, 11(6):659101, June 2015. Publisher:
This research has been financially supported by Mitacs SAGE Publications.
Accelerate (IT15018) in partnership with Canadian Tire Cor- [13] Kenneth Brezinski and Ken Ferens. Sandy Toolbox: A
poration,and is supported by the University of Manitoba. Framework for Dynamic Malware Analysis and Model De-
velopment. In Transactions on Computational Science &
R EFERENCES Computational Intelligence. Springer Nature, SAM4213;
[1] Malwarebytes. 2023 State of Malware. Technical report, Accepted, In Press, 2021.
Santa Clara, CA, 2023. [14] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
[2] Yakang Hua, Yuanzheng Du, and Dongzhi He. Classifying Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser,
Packed Malware Represented as Control Flow Graphs and Illia Polosukhin. Attention is All you Need. In
using Deep Graph Convolutional Neural Network. In Advances in Neural Information Processing Systems,
2020 International Conference on Computer Engineering volume 30. Curran Associates, Inc., 2017.
and Application (ICCEA), pages 254–258, March 2020. [15] Hwanjun Song, Minseok Kim, Dongmin Park, and Jae-
[3] Kenneth Brezinski and Ken Ferens. Transformers - Gil Lee. How does Early Stopping Help Generalization
Malware in Disguise. In Transactions on Computational against Label Noise?, September 2020. arXiv:1911.08059
Science & Computational Intelligence. Springer Nature, [cs, stat].
ACC4507; Accepted, In Press, 2021. [16] Rich Caruana, Steve Lawrence, and C. Giles. Overfitting
[4] Cheng Wang, Jianmin Pang, Rongcai Zhao, Wen Fu, and in Neural Nets: Backpropagation, Conjugate Gradient,
Xiaoxian Liu. Malware Detection Based on Suspicious and Early Stopping. In Advances in Neural Information

484

Authorized licensed use limited to: RV University. Downloaded on March 17,2025 at 05:31:34 UTC from IEEE Xplore. Restrictions apply.
Processing Systems, volume 13. MIT Press, 2000. inference the model output needs to be scaled by a proportional
[17] Xue-Wen Chen and Xiaotong Lin. Big Data Deep amount equal to x̃l := xl  (1 − p) to account for the scaled
Learning: Challenges and Perspectives. IEEE Access, back activations which the model is accustomed with during
2:514–525, 2014. Conference Name: IEEE Access. training. The other considerations for model capacity are the
[18] Douglas Heaven. Why deep-learning AIs are so easy depth of the GNN n (illustrated in Fig. 4) and the number of
to fool. Nature, 574(7777):163–166, October 2019. hidden neurons in each layer d. A sample sequential layer for
Bandiera_abtest: a Cg_type: News Feature Number: a GNN which encodes information for the 2-hop neighborhood
7777 Publisher: Nature Publishing Group Subject_term: (n = 2) is shown in code listing 1. The code listing represents
Computer science, Information technology. a sample architecture in which all of the models outlines in
[19] John D. Kelleher. Deep Learning. MIT Press, September Table III is based upon.
2019. Google-Books-ID: b06qDwAAQBAJ.
GNN(
[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian (layers): ModuleList(
Sun. Deep Residual Learning for Image Recogni- (1): GATConv(in_features=input_dims, out_features=d)
tion. arXiv:1512.03385 [cs], December 2015. arXiv: (2): LeakyReLU(alpha=0.2)
(3): Dropout(p=0.2)
1512.03385. (4): GATConv(in_features=d, out_features=d)
[21] Xavier Glorot and Yoshua Bengio. Understanding the (5): LeakyReLU(alpha=0.2)
(6): Linear(in_features=d, out_features=2)
difficulty of training deep feedforward neural networks. )
In Proceedings of the Thirteenth International Conference )
on Artificial Intelligence and Statistics, pages 249–256. Code Listing 1. Sequential GNN architecture for evaluating Malware detection
JMLR Workshop and Conference Proceedings, March performance for experimental results. input_dims is the size of the input
2010. ISSN: 1938-7228. corpus, which is tabulated in Table I; d is the number of hidden neurons, p is
[22] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya the dropout probability and alpha is the negative slope in the LeakyReLU
activation.
Sutskever, and Ruslan Salakhutdinov. Dropout: A Simple
Way to Prevent Neural Networks from Overfitting. Journal Additionally, the LeakyReLU activation function which was
of Machine Learning Research, 15:1929–1958, 2014. set with a negative slope α = 0.2; where the function evaluates
[23] Diederik P. Kingma and Jimmy Ba. Adam: A Method for to αx when the input value x < 0, otherwise it evaluates to
Stochastic Optimization. arXiv:1412.6980 [cs], January x in the piece-wise function. This function has the advantage
2017. arXiv: 1412.6980. of producing large gradients when needed similar to ReLU;
[24] Ilya Loshchilov and Frank Hutter. Decoupled Weight with the added advantage that gradients do not die out as the
Decay Regularization. arXiv:1711.05101 [cs, math], gradient can still recover when x < 0.
January 2019. arXiv: 1711.05101.
Performance Metrics
A PPENDIX A - G RAPH N EURAL N ETWORK A RCHITECTURE a) Loss: is defined in this work as the Cross-Entropy loss
AND T RAINING C ONFIGURATION which is used to measure the difference between the predicted
probability distribution and the true probability distribution of
Model Weights and Initialization
the target variables. The equation to calculate Cross-Entropy
Weight matrices were all uniformly distributed using He is shown in Eq. 5 for C classes for the true and predicted
initialization [20, 21]. He initialization uses a uniform distribu- probability yi and p(yi ), respectively, for class i. In this case
tion (Wi,jl
≈ N (μ = 0, σ 2 = 0.01)) to randomly initialize the yi is the one hot encoded label vector. Crossentropy and Eq.
weights in a range that is determined by the number of input 5 is a generalized form of the Binary Cross-Entropy, which
and output units in the layer. Additionally, Dropout is used only applies to predictions with 2 classes.
to regulate the network to prevent the model from over-fitting
[22]. Unlike in [10] where the authors implemented dropout C
of the original feature vector h̄j , in this work the sparsity of LCE = ci ∗ yi log(p(yj )) (5)
the vectorized API calls makes it so that dropout would have i=1
a minimal effect. Dropout is however applied after the linear b) Accuracy: in this work is calculated as the macro-
projection and after determining the attentions coefficients, as average of the accuracies whereby the arithmetic mean of each
this is typical in a deep neural network to regulate over-fitting individual class is taken into account. Therefore, for C classes,
and prevent the network from relying on any particular set of the macro-average accuracy is calculated according to Eq. 6.
weights. So for each forward propagation and at each layer, The expression for Eq. 6 has the advantage of calculating the
a binary mask is set for each input unit j, which is drawn accuracy for each class independently, meaning class-imbalance
with probability p where rjl ∼ Bernoulli(p) at each layer l. is accounted for in the metric according the number of training
Each input layer is then masked with rjl where x̃l := xl  rjl to examples for each class Nc .
produce an output with fraction p of layers set to 0. Dropout
probability for this work was tested at p ∈ {0.2, 0.5}. Since
the model learns with a proportion of layers p set to 0, during Averagemacro = N1 ∗Acc1 +N2 ∗Acc2 , . . .+NC ∗AccC (6)

485

Authorized licensed use limited to: RV University. Downloaded on March 17,2025 at 05:31:34 UTC from IEEE Xplore. Restrictions apply.
c) F1-Score: is a performance metric, where F 1–Score ∈ TABLE IV
P ROCESS M ONITOR CONFIGURATION FOR FILTERING
(0, 1), that is used to evaluate the accuracy of a classifier. OUT RELEVANT EVENTS . W ITH THE EXCEPTION OF
It is defined as the harmonic mean of precision and recall, THE FINAL ENTRY, ALL OTHER ENTRIES ARE
where precision is the fraction of True Positive predictions DISALLOW FILTERS .

among all positive predictions, and recall is the fraction Entity Relation Value
of True Positive (TP) predictions among all actual positive Process Name is Procmon.exe
instances. F1-score can be calculated according to Eq. 7; where Process Name is Procexp.exe
Process Name is Autoruns.exe
precision = T P/(T P + F P ) and recall = T P/(T P + F N ); Process Name is Procmon64.exe
where FN is False Negative and FP is False Positive. This Process Name is Procexp64.exe
metric accounts for class imbalance, providing a meaningful Operation begins with IRP_MJ_
Operation begins with FASTIO_
indicator of model performance on both the malicious and
Result begins with FAST IO
benign training examples. Path ends with pagefile.sys
Path ends with $Mft
precision × recall Path ends with $MftMirr
F 1–Score = 2 × (7)
precision + recall Path ends with $LogFile
Path ends with $Volume
Training Configuration Path ends with $AttrDef
Path ends with $Root
The Adam optimizer was used as the training optimizer
Path ends with $Bitmap
which combines stochastic gradient descent with momentum Path ends with $Boot
with RMSProp [23]. Readers can refer to the original paper for Path ends with $BadClus
the procedure, or the Pytorch documentation for an overview Path ends with $Secure
Path ends with $UpCase
of the algorithm. The optimizer’s β parameters were set to Path ends with $Extend
(0.9, 0.999), with a learning rate η ∈ {5 × 10−2 , 1 × 10−2 , 5 × Event Class is Profilinga
10−3 }. L2 regularization was also implemented through weight a alternates between Process, Network, File System
decay λ at 10% of the learning rate [24]. So for a η of 5×10−3 and Registry depending on which events are to be
the decay rate would be 5 × 10−4 . In the Adam optimizer, the collected.

decay rate is multiplied by the weights wt at iteration t which


penalizes the magnitude of the weights at each epoch. Models
were run for 500 epochs each, where one epoch is equivalent to
one pass through of each of the execution graphs. For statistical
analysis models were trained for 20 iterations, where one
iteration corresponds to re-initialization of the model parameters
and optimizer and randomization of the training/validation split.
A PPENDIX B - P ROCESS M ONITOR F ILTER C ONFIGURATION
In Table IV a summary of the allow/disallow list for
the Procmon configuration is presented. The filters remove
processes belonging to the Procmon application itself, as well
as some auxiliary files and formats.

486

Authorized licensed use limited to: RV University. Downloaded on March 17,2025 at 05:31:34 UTC from IEEE Xplore. Restrictions apply.

You might also like