Unleashing Malware Analysis and Understanding With Generative AI
Unleashing Malware Analysis and Understanding With Generative AI
Dissecting low-level malware behaviors into human-readable reports, such as cyber threat intelligence,
is time-consuming and requires expertise in systems and cybersecurity. This work combines dynamic
analysis and artificial intelligence-generative transformation for malware report generation, providing
detailed technical insights and articulating malware intentions.
12 May/June 2024 Copublished by the IEEE Computer and Reliability Societies 1540-7993/24©2024IEEE
Recent research on large language models (LLMs) natural language articles. Our research findings shed
in the cybersecurity field has demonstrated promising light on the potential of LLMs to improve malware
applications, such as malware static analysis2 and mal- understanding and CTI report generation.
ware generation.3,4 For example, Pearce et al.2 explored
prompting LLMs to identify purposes, capabilities, and Malware Activity Trace
variable names/values from code. Trained in vast amounts While dynamic analysis is crucial to comprehending
of unstructured text, including websites, books, and open malware activity, two challenges arise when directly
source code, these LLMs can understand and generate processing the execution trace of malware with an LLM.
natural language in response to input prompts. Inspired by The first challenge stems from the substantial number
this work, we aim to investigate approaches for understand- of syscalls, forming an unfavorable input to LLMs. Spe-
ing malware behaviors and generating descriptive reports. cifically, truncating a prompt due to a restricted maxi-
However, applying standard LLMs, such as ChatGPT, mum input token length may result in inaccurate and
directly to understanding malware activity and the genera- incoherent responses. The second challenge arises from
tion of threat intelligence reports poses challenges. Standard the inherent nature of LLMs, primarily designed for
LLMs are designed for processing text and code rather than processing natural language but exposed to a variety of
sequences of syscalls, potentially limiting their ability to programming languages during training. This suggests
generate requested results. Further details are discussed in that the functionality of LLMs may not seamlessly align
the “Quantitative Analysis” section. To date, there is no pre- with the requirements of dynamic analysis.
trained LLM tailored specifically for dynamic analysis and Our primary objective in this work is to delve into the
cyber threat intelligence (CTI) report generation. Develop- realm of understanding malware activity using off-the-shelf
ing domain-specific LLMs requires significant investments LLMs. To achieve this, we introduce an attack scenario
in computing power, effort, and specialized datasets. graph (ASG) construction to reduce the size of malware
In this study, our objective is to assess the effective- execution while retaining essential information. The pro-
ness of off-the-shelf LLMs in comprehending malware cess involves parsing the trace and mapping the resulting
behaviors and producing CTI reports that delineate the data onto a graph, enabling a more concise representation
malware lifecycle in natural language. Our investigation of the behavior of the malware. Additionally, we employ
focuses on capturing the core essence of the behavior an natural language description (NLD) transformation to
and operations of malware. To accomplish this goal, we convert low-level syscalls into high-level descriptions suit-
conduct dynamic analysis to capture the syscall execu- able for LLMs. Subsequently, a prompt is formulated with
tion trace of the malware. This trace is subsequently the transformed NLDs to generate a coherent natural lan-
translated into a series of structured descriptions. Lever- guage article about the given malware. The architecture of
aging a generative natural language processing model, our proposed system is illustrated in Figure 1.
such as ChatGPT, we enable the automatic generation Our research focuses on driven by the rising preva-
of informative and easily understandable high-level lence of Unix-like operating systems, notably Linux, and
Subject
Malware Sample(s) Syscall Trace ASG I ASG II
Dynamic Redundancy
Analysis Generate Reduction
Extract
Malware List of
Malware List of A Narrative Essay of
NL Transformer NL (Verb, Object)
Syscall Steps Malware Behavior
Descriptions
ChatGPT
Input
(Arrange From
Syscall NL_Synonym
man7 Website)
Base
www.computer.org/security 13
SYNTHETIC REALITIES AND ARTIFICIAL INTELLIGENCE-GENERATED CONTENTS
Linux malware, especially considering the expanding parsing module called syscallParser to extract both the
Internet of Things (IoT) landscape. Cozzi et al.5 highlight name of the syscall and the value of the parameter that
the widespread use of the Linux operating system on var- is the direct object of the function for every syscall invo-
ious devices with varying CPU architectures, including cation within the trace. This extraction process results
x86 and ARM. Although significant attention has been in the creation of a file referred to as the malware activ-
devoted to understanding Windows-based malware, ity trace (MAT). Given that Linux system versions vary,
an understanding of and analytical infrastructure for our study targeted Ubuntu 18.04 LTS (kernel version
Linux-based malware have remained relatively limited. of at least 4.15), which encompasses a total of 415 sys-
This underscores the critical importance of our research calls. Among them, 10 of these calls are either unimple-
in comprehending the operations of Linux malware and mented or not unlisted in the manual page, leaving us
the strategies it employs to achieve its attack objectives. with a total of 405 syscalls in our implementation.6
Therefore, our research is focused on investigating Linux To identify the specific argument serving as the direct
malware in the expanding landscape of interconnected object of each syscall, we conduct a manual review of
devices. It is essential to highlight that our proposed each one. For instance, in the case of the write() syscall,
framework is inherently versatile, featuring graph reduc- the direct object is positioned as the first parameter in
tion and prompting techniques for generating articles the output from Strace,7 representing a file descriptor.
that describe malware activity. As such, its applicability We have devised a parser that scans the entire execution
extends beyond a specific platform. trace file, constructing a symbol table that correlates file
descriptors with their respective file names or paths,
Step 1: Generating a List of <Actor, Syscall as well as socket descriptors and associated network
Name, Direct Object> Triplets address data. Consider the mmap2() function, which
There are several components that are integrated and work is used to map or unmap files or devices into memory.
as a pipeline to provide a new, high-level way of under- In this instance, the direct object is situated as the fifth
standing malware intentions and for malware behavior argument of the call. Following the identification of
analysis. First, we take a malware sample. Using strace, a the direct object argument for each syscall, we have
syscall trace that provides a history of operations that the developed a set of regular expressions to automatically
process performed, is generated for each process spawned extract the syscall’s name and the value associated with
by the malware as well as the main process. Strace is a diag- its direct object for each individual syscall invocation.
nostic and debugging tool that intercepts and records the To enable human analysts to inspect and visualize the
syscalls invoked by a process and the signals received by operations of the malware, we have developed a visualizer
the process. The name of each syscall, its arguments, and module designed to transform the MAT trace into a prov-
its return value will be saved to a specified file. In practice, enance graph G = (V, E), referred to as the ASG. In the
a malware execution, besides the main process, may spawn graph, the source node of a directed edge signifies the ini-
zero or more child processes. After the execution, we com- tiator or actor of the syscall, the edge itself corresponds to
bine all of the execution traces of a malware into one single the syscall and is associated with the respective step num-
trace file using the strace-log-merge command. For exam- ber in the trace, and the destination node denotes the direct
ple, in the malware Dofloo, it spawns 16 child processes. object of the operation. Figure 3 shows an example of an
Figure 2 provides an illustrative example of the execution ASG graph of the malware Dofloo directly derived from
trace pertaining to the Dofloo malware. the trace. The ASG graph incorporates nine distinct node
types: file, process, network, memory, ID, permission, exit
Step 2: ASG status, timestamp, and resource types. Data provenance
Our primary interest is to comprehend the intentions analysis techniques have found widespread application in
and behaviors of malware. We focus on the semantics of the analysis of system logs, allowing the parsing of a log into
the operations carried out in each discrete step. To facil- provenance graphs that encapsulate the entirety of the sys-
itate this understanding, we have developed a syscall tem execution. These graphs serve to facilitate causal analy-
sis, revealing the entities and information flows involved in
an attack campaign.
As an example, in Milajerdi et al.,8 a high-level prov-
enance graph is generated to summarize the actions of
an attacker. This approach utilizes data sourced from
syscalls captured in the Linux system audit log and
represents attack activities using five system entities:
Figure 2. An illustrative example of the execution trace
process, files, network connections, memory objects,
pertaining to the Dofloo malware.
and users within the graph. In our case, we draw from
www.computer.org/security
271. read()
247. mprotect 262. brk 265. read() 266. read() 267. read() 270. read()
0x7f2d2488f000 0x558971622000 0x7f2d250a6000 eth1 eth2 eth4 eth8 eth9
1,527
1,527
385. time 388. time 453. time
372. read 376. read
373. read 375. read 377. read
2022-09-20 2022-09-20 2022-09-20 378. read
374. read
05:05:39 05:05:45 05:07:56
/proc/net/dev
Figure 3. Part of Dofloo’s ASG. We zoomed in on four ASGs (in orange rectangles) from this attack for presentation purposes. Four reduced ASGs are in (a) memory operations reduction,
(b) time operations reduction, (c) IP address enumeration reduction, and (d) file operations reduction, respectively. NIC: network interface controller.
15
SYNTHETIC REALITIES AND ARTIFICIAL INTELLIGENCE-GENERATED CONTENTS
syscall execution traces as well but introduce five addi- considerable number of such call invocations in the trace
tional types. (The ID type here corresponds to the user file. Similarly, we employ a sink node to collectively rep-
in Milajerdi et al.8) Hassan et al.9 develop a tactical resent these time() operations, effectively reducing the
provenance graph to analyze the causal dependencies graph size; an example is shown in Figure 3(b).
between threat alerts generated by endpoint detection ■■ IP address enumeration reduction: Malware samples often
and response systems. In our methodology, we initiate attempt to get information such as the IP addresses
the process by converting the MAT trace into an ASG assigned to Linux interfaces (e.g., eth0, eth1, eth2,
graph, which serves as a preliminary step prior to con- etc.) by exhaustively searching with the read() func-
ducting a more comprehensive analysis. tion. This leads to a proliferation of call invocations
Given that a typical malware execution trace is often in the trace. We use a sink node to collectively signify
extensive, comprising thousands of lines of call invoca- these IP address enumeration operations; an example
tions, the resulting ASG graph can become challeng- is shown in Figure 3(c).
ing to visualize and inspect due to its size. Through our ■■ File operations reduction: Malware frequently searches for
experiments, we have identified four areas where we can target files within a host system, resulting in multiple
implement redundancy reduction techniques, effectively open(), read(), and write() calls to files. In some scenar-
reducing the graph’s size without a loss of information: ios, the number of these calls can extend into the hun-
dreds. To address this redundancy, we combine these
■■ Memory operations reduction: Malware often executes duplicated identical operations into one single edge
multiple memory operations when spawning a new within the ASG; an example is shown in Figure 3(d).
process. Those operations include syscalls like brk()
for adjusting the data segment’s end as well as set_ In the context of these sink destination nodes, a
thread_area(), set_tid_address(), set_robust_list(),
dedicated data structure is established to store the data
futex(), mprotect(), arch_prctl(), and munmap(), all
associated with the objects of the original syscall func-
of which involve memory address manipulation. How- tions to ensure that no information is lost during the
ever, the specific values of these memory addresses do
process of reducing redundancy in the graph nodes.
not inherently reveal much about the malicious intent Table 1 shows the size reduction results of five mal-
ware samples. Dofloo (also known as AESDDoS) is a mal-
of the malware. To address this, we consolidate these
memory operations under a single sink (destination)ware that is used to create large-scale botnets to launch
node, collectively denoting them as memory addresses.
distributed denial-of-service (DDoS) attacks and to load
This approach can significantly reduce the graph size;
cryptocurrency miners to the infected machines. Gafgyt
an example is shown in Figure 3(a). is a backdoor malware that affects IoT devices to launch
■■ Time operations reduction: In numerous malware execu-
DDoS attacks. Darlloz is a malware that targets the IoT
tion traces, the time() function is frequently invoked
and infects routers, security cameras, and set-top boxes
to obtain the current time in seconds. This results in a
by exploiting a Hypertext Preprocessor (PHP) vulner-
ability. LuaBot is a trojan that
is completely coded in the Lua
Table 1. The size reduction results of five malware samples. language, targeting Linux plat-
forms to recruit them in a DDoS
Number of botnet. Tsunami, also known as
Number Number Number of Destination Number Kaiten, is a type of DDoS bot that
Malware of Syscalls ASG of Steps Source Nodes Nodes of Edges uses Internet relay chat to com-
Dofloo 11,696 I 7,577 11 392 7,577
municate with the threat actor.
As expected, botnets for
II 7,577 11 54 284 DDoS attacks have dominated the
Gafgyt 864 I 404 13 36 404 Linux-based malware landscape in
II 404 13 35 79 the last few years. These five mal-
Darlloz 5,019 I 2,075 27 524 2,075
wares belong to the most popu-
lar malware families that harvest
II 2,075 27 315 596 poorly protected IoT devices. The
LuaBot 13,801 I 9,746 3 700 9,746 table presents the sizes of both the
II 9,746 2 288 349 original ASG graph (referred to
Tsunami 764 I 344 2 42 344
as ASG I) and the resulting ASG
graph (referred to as ASG II)
II 344 2 16 38 after the redundancy reduction
Malware
1. exec 7. ugetrlimit
www.computer.org/security 17
SYNTHETIC REALITIES AND ARTIFICIAL INTELLIGENCE-GENERATED CONTENTS
base denoted as linuxSyscall_SynonymBase and is respon- the original <Step 1, malware, exec(), uname> is
sible for transforming each line of data within C into a basic translated into “Step 1: ‘malware’ execute program:
NLD. The synonym base is a structured table-like file that uname.” In line 2, the original <Step 2, malware, brk(),
comprises a collection of tuples, each consisting of three memoryAddr_0x8fe7000> is translated into “‘malware’
elements: the syscall name, verb or verb phrase, and direct change the location of the program break: 0x8fe7000.”
object noun or noun phrase. To establish the Linux syscalls In lines 3 and 4, the malware sets up an entry in the cur-
NL synonym base, we refer to both the name and descrip- rent thread’s thread-local storage array and subsequently
tion sections of a syscall in the Linux man pages.10 We assigns the pointer to a thread ID. Combining these
extract the verb or verb phrase along with its corresponding steps, one can understand that the malware is perform-
direct object noun or noun phrase from the descriptions. ing memory location manipulation. We consider that
For example, for clone(), we extract the verb “create” and this is a significant advance in readability and provides
the noun phrase “a child process.” Table 3 lists some con- higher level semantic presentation of the syscall trace
tents of the linuxSyscall_SynonymBase. from the reader’s perspective. For instance, a forensic
By referencing the linuxSyscall_SynonymBase, analyst can perform dynamic analysis of a suspicious,
the linuxSyscall_NLTransformer module takes a list possibly malicious program and obtain its syscall trace,
of the quadruplets of the ASG graph as the input and which can then be fed to the module to obtain a more
outputs a list of the basic NLDs of the attack steps of easy-to-understand translation of the low-level trace.
the malware, as shown in Table 2. In line 1, i.e., step 1, This pioneering approach represents a notable leap
in enhancing the legibility of syscall traces, offering
readers a more semantically enriched perspective. In
Table 3. Example contents of the Linux syscall NLD base. the next phase of our work, we delve into the potential
to uncover the inherent intentions behind sequences
Syscall Name Verb Phase Direct Object Noun Phrase
of consecutive operations. We aim to enhance our cur-
rent capabilities by not only offering succinct NLDs of
clone() Create A child process syscall traces but also by enriching these descriptions
exit() Terminate The calling process with relevant details about the operations involved. Our
linkat() Create A file link objective is to empower readers with a more profound
understanding and deeper insights into the behavior of
rename() Change Name or location
malicious software, further enriching their comprehen-
brk() Change The location of the program break sion of the subject matter. The algorithm of the Linux
open() Open A file or device syscall NLTransformer is presented in Algorithm 1.
getpid() Get Process identification
Malware Activity Report
exec() Execute A program
wait4() Wait for A process Step 4: Prompting LLMs
Recent pretrained LLMs have performed impressively
and successfully on diverse NL processing tasks, such as
creative writing and crosswords.11 We had a trial explor-
Algorithm 1: linuxSyscall_ ing LLM technology, where an input is a list of basic
NLTransformer NLDs of an ASG graph of a malware, and the expected
output is a coherent and informative passage of the given
Input: stepList, synonymBase input. We had two observations based on the generated
Output: NL_stepList
output. One is that LLMs have the capability to gener-
1: NL_stepList = [];
2: for each step in stepList do
ate more natural text with lexical variety aligning with
3: sourceNode = step["sourceNode"]; the given input. The other is that LLMs are able to pro-
4: syscall = step["edge"] vide relevant information for the given input because
5: destNode = step["destNode"] they have domain knowledge (without fine-tuning) and
6: verb = synonymBase[syscall]["verb"] infer the context. Thus, LLMs facilitate the generation of
7: object = synonymBase[syscall]["object"] comprehensive malware activity articles.
8: NL_step = sourceNode + verb + object + We develop a prompting approach to leverage the
":" + destNode NL_stepList.append creative potential of LLMs to generate a malware activity
(NL_step)
report automatically. Here, the latest version of ChatGPT
9: end for
(version 3.5)12 is used for text generation. We employ a
chain-of-thought13 style approach to prompt ChatGPT
Table 5. The statistical descriptions on traces, trace-based generated reports, and ASG–NLD-based generated reports.
www.computer.org/security 19
SYNTHETIC REALITIES AND ARTIFICIAL INTELLIGENCE-GENERATED CONTENTS
■■ RQ4: How do LLMs contribute to the analysis and and our generated reports. Notably, our generated
generation of reports? outputs demonstrate significant brevity compared
to the others. This shows the advantages of our pro-
Quantitative Analysis posed ASG and the NLD transformation. We empha-
For RQ1, we conduct a comparison between our gen- size the collective representation of behavior through
erated responses and those generated directly from multiple syscalls.
the original execution traces of the provided samples. In this study, LLMs demonstrate the ability to merge
To ensure a fair comparison, we use the same prompt and condense pertinent NLDs linked to multiple sys-
as presented in Table 4, with the only difference being calls, encapsulating the intent of the operation into con-
the replacement of the input with the original traces. cise statements. This discovery underscores that LLMs
In Table 5, we provide a statistical description of the play a crucial role in enhancing our comprehension of
four malware samples, comparing the original execu- malware behaviors. For example, LLMs succinctly sum-
tion traces with both trace-based generated reports marized 174 NLDs of Darlloz into one sentence, stating,
“Similarly, the process repeats these actions for pro-
cesses 2 through 1882, sequentially opening their cor-
Table 6. The coverage and complementarity
responding stat files in the ‘/proc’ directory and reading
ratios of four malware samples.
the file descriptors (Steps 24 to Steps 197).” Moreover,
LLMs go beyond mere transformation, providing con-
Malware Coverage (%) Complementarity (%) cise insights into the intent of operations. For instance,
Gafgyt 89.54 58.98 in the case of Tsunami, LLMs inferred the achievement
of persistence as the sample attempted to access files
Darlloz 64.04 34.69
associated with system initialization and configuration
LuaBot 44.4 42.46 on Unix, as articulated in the statement, “Persistence
Tsunami 94.74 64.82 Mechanisms: Intriguingly, the ‘malware’ opens the files/
etc/rc.d/rc.local (Step 8) and/etc/rc.conf (Step 9). . . .
By accessing these files, the malware is likely attempting
Table 7. An excerpt of the given prompt and its generated to achieve persistence. . . .”
response to describe Gafgyt. Note that when taking traces as inputs, we made three
observations: 1) LLMs might deviate from instructions
Prompt: to produce a report; 2) LLMs might generate some irrel-
Step 1: “malware” check user permissions: /usr/bin/python evant technical descriptions; and 3) LLMs might refuse
Step 2: “malware” get user identity: UID:0 to comply with requests, providing responses such as,
... “I’m sorry, but as an AI language model, I cannot per-
Step 31: “Process 1526” open file or device: /dev/watchdog
form the analysis for the provided trace. It appears to be
Step 32: “Process 1526” open file or device: /dev/misc/ watchdog
Step 33: “Process 1526” open file or device: /sbin/watchdog a syscall trace, and I would recommend using a special-
... ized tool for analyzing syscall traces, such as strace or a
similar tool, to interpret this data,” in some cases. This
Response:
User Permissions and Identity: The malware (“malware”) begins by
suggests that the current capability of LLMs might be
checking the user permissions for the /usr/bin/python file (step 1). It then insufficient to directly analyze malware activity and gen-
retrieves the user identity and identifies it as UID:0, indicating root privileges erate an intelligence report.
(step 2). This privileged user status grants the malware significant control For RQ2 and RQ3, we analyzed our generated out-
over the system. puts to assess our system’s capabilities in coverage and
... complementarity, respectively. The coverage ratio is
File and Device Operations: Process 1526, a child process created by measured as the set of words in the NLDs covered
the “malware,” performs a series of file and device operations. It starts by within the generated reports, while the complementary
opening the file or device /dev/watchdog (step 31) and /dev/misc/watchdog
ratio is determined as the set of words in the generated
(step 32), which are related to watchdog functionality. It continues by
opening /sbin/watchdog (step 33) and /bin/watchdog (step 34), indicating
reports not present in the NLDs. To compute coverage
the execution of specific watchdog programs. . . and complementarity, two texts are treated as bags of
Watchdog Functionality and System Monitoring: The file and device words, with the words transformed into stems using the
operations performed by process 1526 indicate its engagement with Porter stemming algorithm.
watchdog functionality. Watchdog timers are used in systems to monitor the Table 6 indicates that most of the NLDs are included
health and availability of critical processes. By opening watchdog-related files in the generated reports of Gafgyt (89.54%) and Tsu-
and devices, process 1526 suggests an intention to control or manipulate the nami (94.74%), where reports convey almost the same
watchdog mechanism. . . message in more natural language forms. On the other
www.computer.org/security 21
SYNTHETIC REALITIES AND ARTIFICIAL INTELLIGENCE-GENERATED CONTENTS
www.computer.org/security 23