Master Thesis Fikret
Master Thesis Fikret
Master Thesis Fikret
Labeling
Generating labeled APT host datasets
using CALDERA and Sysmon
Fikret Kadiric
Department of Informatics
Faculty of mathematics and natural sciences
UNIVERSITY OF OSLO
Spring 2022
APT Attack Emulation and Data
Labeling
Fikret Kadiric
© 2022 Fikret Kadiric
http://www.duo.uio.no/
Cyber criminals and others seeking to use the cyber domain for malicious
purposes are actively pursuing and attempting to exploit new vulnerabilit-
ies. Cyber attacks are constantly changing and new attacks are developed.
While cyber criminals are trying to attack and breach systems, defenders
on the other hand are attempting to fend off the attackers. The introduc-
tion of Intrusion Detection Systems (IDS) has been a great contribution to
the defense of the cyber domain. These systems can be trained on datasets
containing attack data in order to automatically detect these attacks in the
future.
However, due to challenges regarding labeling and production of attack
data, publicly available labeled host datasets are rare. The situation
becomes even worse when it comes to APT datasets. APT attacks tend to
utilize new vulnerabilities and tools, these attacks are constantly adapting
and evolving over time. As such, APT datasets may quickly become
outdated, leaving defenders to rely on outdated datasets. In order to
keep pace with the evolving attacks and detect these, we need datasets
containing these attacks.
This thesis examines a new approach to generating labeled datasets
through an automated labeling tool developed by the author of this thesis.
Attack data is generated by an adversary emulation tool (CALDERA)
while recording the occurring system changes through System Monitor
(Sysmon). It is found that the labeling tool is capable of generating
fine-grain labeled datasets, applying labels on the attack technique level
directly tied to MITRE ATT&CK. This new approach enables the creation
of new datasets in a convenient and efficient manner, allowing researchers
to create specific datasets if desired. This thesis is a contribution to the
research within the field of cybersecurity.
i
Acknowledgements
ii
Contents
1 Introduction 1
1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Research Methods . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Background 7
2.1 Existing datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Traffic generation using virtualization . . . . . . . . . . . . . 9
2.3 Data Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Windows Event Logs . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Sysmon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6 MITRE ATT&CK . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.7 Adversary Emulation . . . . . . . . . . . . . . . . . . . . . . . 15
2.7.1 Automated Adversary Emulation . . . . . . . . . . . 16
2.7.2 CALDERA . . . . . . . . . . . . . . . . . . . . . . . . . 18
iii
4.2 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.1 Applying labels to logs . . . . . . . . . . . . . . . . . . 43
5 Results 54
5.1 Emulation of APT29 . . . . . . . . . . . . . . . . . . . . . . . 54
5.2 Final Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3 Labeling Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Appendices 71
.1 Labeled logs related to T1134.002 . . . . . . . . . . . . . . . . 72
iv
List of Figures
v
5.2 Successful, Failed, and Skipped tactics . . . . . . . . . . . . . 56
5.3 Distribution of benign and malicious labels . . . . . . . . . . 57
5.4 Distribution of benign, uncertain, and malicious labels . . . 58
5.5 Distribution of tactics across malicious events . . . . . . . . . 59
5.6 Technique label distribution in dataset . . . . . . . . . . . . . 60
5.7 Distribution of tactics and techniques from trail emulations . 61
vi
Chapter 1
Introduction
1.1 Context
Cybercriminals are actively pursuing and attempting to exploit vulnerab-
ilities residing within e.g., computer networks, or software. These vulner-
abilities, also called weaknesses, can occur as a result of misconfigurations,
flaws, or user errors to name a few. If successfully exploited, a vulnerability
may give an attacker entry to the network, allowing them to conduct fur-
ther malicious operations. Discovering and mitigating vulnerabilities may
stop most novice hackers, but it is seldom enough to keep a determined
and skilled hacker at bay. A mitigated vulnerability does not necessarily
mean that it cannot be exploited. Advanced hackers may gain access by
exploiting unknown vulnerabilities or flaws that are yet to be discovered,
or by chaining together multiple low priority vulnerabilities. No matter
how secure a system is, it is difficult to guarantee perfect security.
Dealing with advanced persistent threat actors, also called APT,
requires a deeper level of defense and detection. What separates an
APT from most other hackers is their ability to access confidential
information using advanced techniques, with a slower attack process
in order to reach their goals [39]. The attack phase of an APT often
consists of five main phases: reconnaissance, foothold-establishment,
lateral movement/discovery, data exfiltration, and post-exfiltration [1].
Once inside, APTs can blend in with the background “noise” by utilizing
legitimate processes, often combining the use of multiple processes to
perform malicious activities. APTs are usually highly capable and well
funded threat actors, with advanced tools and close to unlimited resources
and time. Ideally, we want to detect these attacks as early as possible,
but as APT attacks are usually more sophisticated and complicated they
often manage to bypass the security mechanisms in place. The goal of an
advanced attacker is not merely to compromise the network and achieve
unauthorized access, but usually includes more specific goals such as:
stealing data and information, or disrupting critical services. These goals
require comprehensive post-compromise work, where the attacker desires
to stay in the network undetected for as long as possible.
1
Intrusion detection systems, also called IDS, are tools that automatically
monitor and analyze the behavior of a computer or a network environment
and report any suspicious activities detected. IDS grants an additional
layer of system security by recognizing patterns that are known to relate
to malicious traffic or directly tied to adversaries. These systems are
generally classified in two distinct ways, either based on their operational
contexts e.g., network or host-based detection, or on how they aim to fulfill
their task e.g., supervised detection or unsupervised detection. Supervised
detection utilizes a training dataset in order to train the IDS to differentiate
between malicious and benign traffic, while unsupervised detection is
trained on the normally functioning system as a baseline for comparison
between expected and unexpected traffic [9]. Research within the field
of intrusion detection systems has flourished over the years. However,
many researchers struggle to find suitable datasets to evaluate and test the
efficiency of their proposed models [27]. Issues regarding privacy concerns,
data labeling, different research objectives, and availability of datasets are
some of the issues described by Nehinbe [35]. Machine learning, which
is widely used in anomaly-based IDS, learns from the provided datasets.
The datasets may either be labeled with e.g, malicious and non-malicious
data, or without labels. From these observations the algorithms used are
able to generalize new observations, thus allowing future observations to
be automatically classified. The importance of correctly labeled datasets for
IDS efficiency evaluations has been highlighted by several researchers e.g.,
Catania et al. [7], and Davis et al. [14].
2
Generating such APT datasets presents multiple challenges [40]:
• Storage and processing problems related to big amounts of data. Due
to the nature of APT attacks occurring over a long period of time, we
need to collect enough data to properly represent this, ideally having
data covering at least several months.
• There is no set path that all APT attacks follow, APT attacks may vary
in terms of e.g., tools, techniques, tactics, and procedures.
• APT attacks are constantly adapting and evolving over time, they also
tend to utilize new vulnerabilities and tools.
• Labeling of APT dataset should include expert knowledge.
• Simulating an APT attack in a realistic manner.
Aside from the challenges directly tied to APT attack data, there are also
several other challenges particularly related to the process of labeling the
datasets [28]:
• The data is generally generated in large volumes, usually making
manual labeling of all lines infeasible.
• A single action may manifest itself in multiple log sources.
• Log lines corresponding to malicious actions may be interrupted by
normal log messages as processes are frequently interleaving.
• Execution of a malicious command may cause manifestations in logs
at a much later time due to delays or dependencies on other events.
Using a training dataset with correctly labeled logs is a crucial aspect
related to the efficiency of the IDS, ideally with labels in greater details
than e.g., "benign" and "malicious".
3
1.4 Research Methods
This thesis is a study in cyber security, a subfield within the computer
sciences field. Fundamentally the thesis is following a scientific research
method through experimental research design [16]. It is necessary to design
an experiment that generates data in order to evaluate the truth of the
hypothesis. The system around the experiment is controlled to ensure that
only specific inputs of the system are altered to determine the effect. The
following characteristics are exhibited in a good experiment [16]:
4
1.5 Thesis outline
This section provides an overview of the thesis chapters and their content:
Chapter 4: Data Collection and Labeling This chapter describes the pro-
cess of collecting and processing the logs. It gives an insight into how the
developed labeling tool applies fine-grain labels to the logs, and the vari-
ous iterations of the tool leading up to the final version.
Chapter 5: Results This chapter presents the results of the emulation and
the final dataset created with the labels applied.
Chapter 6: Discussion and Related Work This chapter will discuss and
compare the final result of the thesis with related research. Limitations re-
lated to the presented dataset, labeling approach, and developed tool is
also discussed.
Chapter 7: Conclusion and Future Work This chapter will conclude the
thesis based on the discussion, answering the research questions of the
thesis. Suggestions for future work is also presented in this chapter.
5
1.6 Contribution
With the developed labeling tool and approach for applying fine-grain
labels, the research has addressed the following hypothesis:
6
Chapter 2
Background
7
These datasets contain what are considered regular malware attack vec-
tors where the goal is to successfully infect a network or a client. While the
detection of APT involves the identification of long-term attack behavior.
Myneni et al. [33] has attempted to generate a dataset suitable for APT
behavior which considers further post-compromise phases, such as data
exfiltration and lateral movement. Attack data was generated by having
a red team simulate the attack behavior of an APT. Vulnerabilities were
discovered by using various scanning tools, and later exploited through
scripts and known attacking tools e.g., metasploit [38], in order to establish
a foothold within the network. From this foothold, the red team moved
laterally within the network by employing known lateral movement tech-
niques before collecting and exfiltrating data.
The comparison of multiple datasets done by Myneni et al. [33] revealed a
massive lack of data related to the later stages of an advanced attack. As
seen in figure 2.1, most of the datasets only covered the early attack phases,
such as reconnaissance and foothold establishment. Only the DAPT2020
dataset covers data exfiltration, which often is the goal of most advanced
attacks.
Figure 2.1: Attack phases covered by known datasets Myneni et. al [33]
8
Datasets that may be considered representative of realistic data can quickly
become outdated. To constantly generate new datasets, a framework for
generating both benign and attack data is desirable. Despite these potential
drawbacks of virtualization, the DAPT2020 [33] dataset contains unique
and valuable attack data.
9
The lack of variations results in a clearer separation of benign and malicious
activities, which in turn leads to overoptimistic results from any detection
systems trained on the data [10]. To address this issue, and several
other issues related to data generation through virtualization, Clausen et
al. [10] proposed a new framework for traffic generation by clustering a
series of interacting containers with one another, where each container was
responsible for either hosting a specific service or performing a specific
task. Executing a scenario e.g., a step of an attack, would trigger the launch
of several containers, dynamically launching new containers depending on
the results of an operation. In order to introduce some variation, a series
of sub-scenarios were created. These sub-scenarios consisted of simple
variations of a scenario, such as mixing up the use cases for different
protocols and determining if the specific operations are successful or not.
Most notably was the use of this framework to achieve ground truth,
being able to separate and label the data from different origins with
stronger certainty. The noise of background processing was minimal since
each container was only running a specific piece of software or application
related to a scenario. Data from each container was collected separately
so that the distinction between the traffic’s origin was clear, this could
allow for higher label granularity. Label granularity refers to how specific
a label is. Higher label granularity could be achieved by labeling the data
by the distinct types of attack it represents e.g., host discovery, privilege
escalation, or credential dumping. While benign data could be labeled by
e.g., service, software, or protocol related to the traffic. The impact label
granularity has on machine learning classification has been studied within
the field of image recognition. Studies within this field have shown that
better label granularity can lead to higher accuracy and higher training data
efficiency, which may reduce the amount of training data needed [8]. To
what extent label granularity may affect the training of intrusion detection
systems has not been studied, but it stands to reason that more granular
labels are desirable to distinguish between the different types of activities.
10
In order to trace this attempt further, we would need network data to de-
termine if the adversary attempted to move laterally to another endpoint. If
both these behaviors are observed in correlation with one another, it would
indicate a potential attack with a stronger probability.
11
Windows uses nine audit policy categories, and 50 subcategories which
allow for a more granular control over what information is logged. These
policies defines the type of system objects that will be monitored (e.g., files,
registry keys, network events), what type of access should be recorded (e.g.,
read, write, permission changes), and who’s accesses should be monitored
(e.g., users, system, all). Depending on what information is to be collected,
the configuration can be found in either the System Access Control List
(SACL) or in the Local Security Policy. The SACL enables the logging of
attempts to access a security object (e.g., files and registry objects), which
can generate audit records when an access attempt fails, succeeds or both.
The Local security policy controls what sorts of events and objects gener-
ate events in the logs, it contains an Audit Policy section and an advanced
Audit Policy section. The advanced audit policy section allows for more
granular audit controls. While enabling an Audit Policy, the policy may be
enabled to log Success events, Failure events, or both, depending on the
policy as some policies only generate success events. Only enabling log-
ging of Failure events for each category is generally not recommended, as
many of the most important events (e.g., changes to critical user account-
s/groups, account lockouts, security setting changes) are Success events.
12
Logs can be collected once the logging mechanisms have been enabled
and configured. Turning on all the logging mechanisms would provide
the most coverage, the drawback of this is the volume of log events. The
generated volume would quickly overwhelm commodity storage systems
and impact the performance of the monitored systems. If the logs are sent
to a centralized log management solution, the volume of logging generated
may impact the network’s performance, particularly if the information
is transmitted across low bandwidth connections. Limiting the volume
generated is thus crucial in order to avoid any unintended side effects
on normal business processes. This may be achieved by configuring the
policies to filter out the majority of irrelevant events, while still maintaining
the ability to detect malicious behavior and collect relevant information
regarding the operations conducted prior to the event. However, deciding
what configuration yields the most coverage while limiting the volume is
no easy task. Ideally we want coverage over events related to the occurring
incident, while filtering out “background noise”. What is considered
relevant events varies depending on the ongoing operation and what
activities are being conducted as a result. If the goal is to detect an data
exfiltration event, monitoring for file creations and modifications as well
as network connections is essential, but enriching this data with account
logon/creation or system events aids greatly when attempting to trace
down the root of the issue. In order to gain the most out of the logs the
correct audit policy configurations must be applied. Start with the Audit
Policy Recommendations from Microsoft [32] is a good starting point, but
further configuration of the policies and perhaps other logging sources may
be necessary to gain the desired coverage.
2.5 Sysmon
While the recommended audit policies grants some visibility of system
activities, tools such as System Monitor (Sysmon) [48] provide even
greater visibility. Sysmon is a highly configurable tool from the Windows
Sysinternals Suite that monitors and logs system activities. It provides
detailed information regarding process creation/termination, driver and
library loads, network connections, file creations, registry changes, process
injections and more. In comparison to the built-in Windows audit logs,
Sysmon is capable of monitoring a wider range of system activities, with
each corresponding event containing significantly more information and
details. The increased information and details is of great help when
detecting and mapping corresponding indicators of compromise. Figure
2.3 contains a complete list of events that Sysmon can record [48].
13
Figure 2.3: Sysmon Event Table
14
2.6 MITRE ATT&CK
The MITRE ATT&CK framework [43] is a knowledge base and model for
adversary behavior. This framework is designed to reflect the various
attack phases of an adversary and the platforms they are known to target.
The framework consist of the following core components:
15
With more than 250 techniques within ATT&CK [43], testing for cover-
age on all of these techniques individually is clearly a time-consuming and
labor-intensive task. Some of the techniques are also completely legitim-
ate commands that may be executed during normal operations. Pivot off
only one indication solely, and determining that it is a malicious instance
can be challenging. To confirm that this is indeed a malicious instance,
defenders can look for other techniques that might have been used in con-
junction with the previous detection. A domain admin login by itself does
not mean that the account has been compromised, but if seen in conjunc-
tion with group discovery or credential dumping it might raise a concern.
Atomic red focuses strictly on technique execution which can be valuable
for signature detection development. A more complex tool is necessary for
those seeking to fully emulate an adversary.
16
Applebaum [2] gives an insight into the complexity of decision making and
emulation techniques, as well as comparing the performance of the differ-
ent techniques. Six automation techniques were chosen to show how an
attacker can choose to execute an operation in a given scenario. Three of
the techniques only consider the immediately possible actions, while the
other three operate using a planning paradigm, where they attempt to plan
for actions to execute in the future. These planning-based strategies chain
together multiple actions to achieve a future goal e.g., in order to achieve
exfiltration, the plan could be to escalate, enumerate the host, and collect
information to exfiltrate. The best possible plan to execute may not always
be clear, as adversaries are often operating with uncertainty in the envir-
onment e.g., what services or endpoints exist within the network, or if the
endpoints are vulnerable to exploits. To combat this the planning-based
agents evaluate the situation by constructing a simulation of the environ-
ment based on knowledge currently available and evaluate possible ac-
tions. Once an action is executed the response is observed and added to
the internal knowledge base of the agent.
A total of six agents, one for each automation technique, were included
in Applebaum’s experiment as attackers. These simulated attackers
leveraged techniques from six different tactics in the MITRE ATT&CK
framework [43]. The core of the MITRE ATT&CK framework, is a set of
high-level tactics that describe the goal of an adversary by classifying the
tactics and techniques commonly used by adversaries. Tactics represent
certain objectives an adversary might have, such as privilege escalation
or exfiltration. Techniques represent various ways to achieve a given
objective, such as exfiltration over alternative protocols for instance
Net/SMB or FTP. The six tactics simulated covered the following post-
compromise behavior: lateral movement, privilege escalation, exploit
execution, credential dumping, host discovery, and account discovery. The
results of each agent were measured based on a percentage of: endpoints
that the adversary established a foothold on, account credential obtained,
endpoints that the adversary was able to execute exfiltration from.
The work done by Applebaum et al. [2] revealed the possibilities
of automated decision techniques, despite the attacker having no prior
knowledge or insight of the network. The planning-based strategies
outperformed the immediate execution ones across all scoring methods.
However, the time spent in order to make a decision was much longer
on the planning-based strategies compared to the immediate-execution
ones. While the planning-based strategies outperformed the immediate
execution ones, there may be time-critical scenarios that necessitate quicker
decision making.
17
2.7.2 CALDERA
One of the open-source tools available for automated adversary emulation
is the tool CALDERA, developed by MITRE [6]. CALDERA offers
an intelligent and automated adversary emulation, intended to test
endpoint security of a Windows system against common post-compromise
adversarial techniques. It comes with a remote access agent, a database,
and server components. The database contains predefined attacks taken
from the ATT&CK framework [43] which can be freely combined to create
custom attacks. Once configured, CALDERA offers a graphical web
interface that allows operators to control and monitor the agents, and give
visual feedback on ongoing operations. Even without further configuration
CALDERA offers several adversaries profiles, which are collections of
several techniques and tactics, designed to create a specific scenario on an
endpoint or network. These predefined scenarios are limited to persistence,
privilege escalation, discovery, command and control communication
using a remote access trojan (RAT), and lateral movement. However,
CALDERA is not limited by these predefined scenarios, the database
that comes with CALDERA allows operations to create customized attack
scenarios in a user friendly manner. Through a dropdown menu operators
can choose the desired tactics and technique, and directly add them to the
desired scenario. These tactics and techniques can be further customized
by editing the commands that are being executed. On the other hand, as
CALDERA is developed for emulating post-compromise techniques, the
attack tactics related to the initial compromise is limited to spear phishing
techniques. CALDERA is thus best suited when the goal is to emulate post-
compromise behavior.
The infrastructure of CALDERA consists of two main components: a
master server, and a remote access tool. The remote access tool is initially
placed on one or multiple endpoints within the network and communicates
with the master server. The master server is responsible for the decision
making process, constantly updating its internal knowledge database with
the information received from the remote access tool. CALDERA uses a
planning-based approach when deciding what actions to take, a plan is de-
veloped based on the current information in its knowledge database. From
the initially compromised endpoint, CALDERA then expands its foothold
onto other endpoints by placing new remote access tools once the endpoint
has been compromised. Once the remote access tool is in place it starts to
communicate back to the master server, by compromising new endpoints
and collecting data from these the master server is able to gradually expand
its knowledge database and map the compromised environment. This de-
cision making process was tested and explained further by Applebaum
et al. [3]. CALDERA was also tested on several small internal networks
within the authors organization, where it was able to detect network mis-
configurations and vulnerabilities.
18
Result from experiments conducted by Applebaum et al. [3] validated the
intelligence component behind CALDERA, showing that the decision en-
gine was able to string multiple actions together to create complex attack
patterns, successfully generating attack artifacts and compromising new
endpoints as it moves through the network.
19
Chapter 3
20
Higher label granularity also allows for more specific training e.g.,
detecting various stages of an attack or what tactic and technique
are being detected. Partially-labeled involves having a small fully-
labeled dataset and a larger unlabeled dataset, utilizing a semi-
supervised approach when training the system. The desired dataset
is a fully labeled datasets with preferably high label granularity,
including both normal and malicious traffic.
21
3.1 Environment Overview
This section will provide an overview of the developed environment and
its components used during this research. The environment consists of
multiple virtual machines (VMs) hosted on a personal machine. Vagrant
[47] is utilized in order to build the computer domain which will act
as the enterprise network, while the CALDERA server is hosted on a
separate Ubuntu VM. Hosting CALDERA on a separate server removes the
potential of having fragments related to the operation of CALDERA within
the developed dataset.
22
Vagrant [47] and DetectionLab [15] was later introduced in order to develop
a more realistic configuration of an enterprise network , with computer do-
mains, multiple hosts, and a domain controller. The addition of multiple
hosts presents CALDERA with possible targets for lateral movement. This
environment was developed for the purpose of generating raw logs, the
labeling tool is independent of the work environment, but depended on
CALDERA performing the attack. The labeling tool is explained in greater
details in subsection 4.2.1.
This domain was then customized by replacing the logger and event
forwarder (WEF) server with three additional endpoint hosts to serve
as possible targets for lateral movement attacks, the final domain and
environment is illustrated in Figure 3.1. By removing these servers we are
eliminating the possibility of unwanted artifacts in the logs as a result of
the endpoints connecting and transmitting data to the WEF and Logger.
For this experiment, the domain is limited to four endpoints, one domain
controller and one Ubuntu server hosting CALDERA. Given the nature of
this experiment, WEF and Logger servers were deemed redundant.
23
3.3 Sysmon configuration
While not strictly necessary, having a Sysmon configuration based on the
use case is recommended as it helps to tune and filter the logs generated
before processing them. Sysmon will quickly fill up the log files with the
default configuration. The built-in filtering abilities of Sysmon are a part of
what makes Sysmon a more sophisticated tool than the built-in windows
audit log feature, it would be unwise not to take advantage of this. By
restricting the log volume and removing irrelevant data, we minimize
possible constraints on the system and network which may occur as a result
of large log files being transmitted and processed.
Applying filters to Sysmon can be done by applying filter conditions
such as: is, is not, contains, contains any, contains all, excludes, excludes
any, excludes all, begin with, end with, less than, more than, image. Used
in conjunctions with the onmatch rule in Sysmon, which states whether
or not the events are to be logged, administrators can create very specific
configurations files, customizable down to any information regarding
the process e.g., hash, filename, IP, filepath, port number, etc. Two
recommended configurations of Sysmon are that of SwiftOnSecurity [41],
or the configuration from Olaf Hartong [36]. Both of these configurations
were studied and found quite similar, effective at filtering out the most
noisy background process while maintaining coverage of known attack
vectors. The major difference in Hartong’s configuration file was the
inclusion of detection based on image loaded (e.g., DLL files), file
deletion/overwrite events, processes accessing other processes and process
tampering. These were disabled by default from SwiftOnSecurity, with the
argument that they could potentially cause high system load. However, it
was found that the inclusion of these, through Hartong’s configuration file,
did not cause high system load on our test-bed. This is mostly likely due to
the detailed filtering in place.
24
3.4 Host Configuration
All Windows hosts are running Windows 10, Domain Controller is running
on a Windows Server 2016 and the CALDERA server is running on Ubuntu
20.10. Within DetectionLab, Active Directory (AD) is pre-configured and
the hosts have their Windows Audition configurations set through the
group policy (GPO) in AD, including command-line process auditing.
In addition to this, the CALDERA agent was placed on one single host to
grant CALDERA entry to the work environment. This host will act as the
initial attack vector. Only one host is initially infected in order to simulate
one compromised host in an entire network. CALDERA will attempt to
move laterally from this host in order to infect the rest of the environment.
25
Figure 3.2: Deploying agent through CALDERA interface
The agent is created through the interface of CALDERA, where the user
can specify what operating system the attack target is running. CALDERA
will dynamically build and compile the agent based on the information
supplied. In order to drop the agent on the target host the following code
from CALDERA is manually executed through PowerShell on the target
machine:
1 $server=" http ://192.168.56.105:8888 " ;
2 $ u r l = " $ s e r v e r / f i l e /download " ;
3 $wc=New− O b j e c t System . Net . WebClient ;
4 $wc . Headers . add ( " p l a t f o r m " , " windows " ) ;
5 $wc . Headers . add ( " f i l e " , " s a n d c a t . go " ) ; $data=$wc . DownloadData ( $ u r l ) ;
6 $name=$wc . ResponseHeaders [ " Content − D i s p o s i t i o n " ] . S u b s t r i n g (
7 $wc . ResponseHeaders [ " Content − D i s p o s i t i o n " ] . IndexOf ( " f i l e n a m e = " ) +9)
. Replace ( " ` " " , " " ) ;
8 get − p r o c e s s | ? { $_ . modules . f i l e n a m e − l i k e "C: \ Users\ P u b l i c \$name .
exe " } | stop − p r o c e s s − f ; rm − f o r c e "C: \ Users\ P u b l i c \$name . exe "
−ea i g n o r e ; [ i o . f i l e ] : : W r i t e A l l B y t e s ( "C: \ Users\ P u b l i c \$name .
exe " , $data ) | Out−Null ; S t a r t − P r o c e s s − F i l e P a t h C: \ Users\ P u b l i c
\$name . exe −ArgumentList " − s e r v e r $ s e r v e r −group red " −
WindowStyle hidden ;
26
Once the code has been executed the CALDERA agent is fetched from
the CALDERA server and displayed in the agents tab of the interface.
27
Figure 3.5: CALDERA Adversary Profile
Some of these smaller scenarios e.g., Figure 3.5, were executed during
the testing and development phase of the project, initially to test the
detection capabilities of Sysmon, and later during the development and
improvements of the labeling tool. Having smaller logs also made
manually inspecting the resulting logs more feasible. With CALDERA
version 4.0.0, MITRE implemented a new plugin which integrates MITRE’s
adversary emulation library [4] with CALDERA. The emulation library
contains fully developed emulations plans for multiple threat groups,
including APT29 [22].
28
3.5.3 APT29 Emulation
The emulation plan of APT29 is developed from publicly available sources,
describing the motivations, objectives and attributed tactics, techniques,
and procedures mapped to MITRE ATT&CK. APT29 is a threat group
that has been attributed to the Russian Foreign Intelligence Service
(SVR), who have been in operation since at least 2008. This group
reportedly compromised the Democratic National Committee starting in
the summer of 2015, and was named as one of the perpetrators of the
cyber espionage campaign that exploited the SolarWinds Orion platform
[46]. The emulation plan chains together techniques into a logical order
that has been observed across previous APT29 operations. The operations
are divided into two distinct scenarios, with a total of 20 steps and 79
operations conducted through CALDERA.
Scenario 1
Scenario one consists of a rapid espionage mission that focuses on gather-
ing and exfiltrating data, before transitioning into stealthier techniques in
order to achieve persistence, further data collection, credential access, and
finally lateral movement.
29
Figure 3.7: Step 2 CALDERA Operations
30
Figure 3.9: Step 4 CALDERA Operations
31
Figure 3.11: Step 6 CALDERA Operations
32
Step 9 - Collection
Additional tools are uploaded to the secondary victim (T1105) before
initiating a search for documents and media files on the file system (T1083,
T1119). The files are then collected (T1005), encrypted, and compressed
(T1002, T1022) into a single file (T1074) before exfiltration through the
existing C2 connection (T1041). Files associated with this access are deleted
(T1107) before moving on to the next step.
Scenario 2
Once step 10 in scenario 1 has been completed CALDERA moves onto
scenario 2 of the emulation. This scenario consist of a stealthier and slower
approach to compromise the initial target, establish persistence, gather
credentials, enumerating and compromising the entire domain.
33
Unlike the payload in step 1, this payload verifies that it is not execut-
ing in a virtualized analysis environment through a series of enumeration
commands (T1497, T1082, T1033, T1016, T1057, T1083) before establishing
persistence through a Windows Registry Run key entry (T1060). Finally,
the ADS executes a PowerShell stager to create a command and control
connection over port 443.
34
Step 15 - Persistence Establishment
Additional persistence is established by creating a WMI event subscription
(T1084) which executes a PowerShell payload when the user signs in.
35
Step 19/20 - Clean up and Execution of Persistence
Clean up is performed by loading and executing the Sdelete binary (T1055)
within PowerShell, making the deleted files associated with the access
nearly unrecoverable. The emulation ends with a reboot of the infected
client, triggering the previously established persistence mechanisms.
36
The majority of techniques in this emulation are related to the Discovery
tactic, with 13 unique techniques leveraged across 27 individual operations
followed by defense evasion techniques. Lateral movement techniques
is the least represented tactic in this emulation, with only 5 operations
involving lateral movement techniques. However, the distribution of
tactics in the logs may differ from the distribution seen in figure 3.21 as
some tactics will generated multiple log lines related to the same operation
while other will generated fewer log lines.
37
• API Server - Responsible for managing the clients
38
Chapter 4
• Process creating with full command line for both current and parent
processes.
39
An alternative format is JavaScript Object Notation (JSON), which is
a highly readable data-interchange format, it is easy for humans to read
and for machines to parse [23]. JSON format has grown quite popular as a
standard format for structured logging, it is both readable and reasonably
compact. JSON format provides a standardized format for structuring data
and most programming languages can parse it. A standardized format is
desired in the case of other operating systems being added as further work
on this thesis.
4.1.2 Winlogbeat
Winlogbeat [Winlogbeat] was introduced to developed environment as
a tool for both log collection and file conversion. Winlogbeat is most
commonly used as a data shipper which sends the collected logs from
clients to a centralized log management solution. Event logs are read using
WindowsAPIs, filtered based on user-configured criteria and then sent to
the configured outputs. Winlogbeat can capture event data from any event
logs running on the system e.g, application, hardware, security or system
events. It is part of the “Beats family” from the Elasticsearch Stack [17].
Beats is a free and open platform for data shippers. The following are the
types of Beats from Elasticsearch:
• Auditbeat: Collection and shipper for Linux audit data, and monitors
the integrity of files.
With the scope of this thesis being Windows Logs and endpoint
detection, Winlogbeat is the only Beat being implemented. The addition
of new Beats is possible for further work. Packetbeat could be utilized for
the collection of detailed network related data, the Sysmon configuration
in place monitors the relevant network connection. In this case, this is
only limited to C2 traffic and data exfiltration. These events will not
go undetected as they can be detected by Sysmon as outgoing network
connections.
A drawback with Winlogbeat is the increase of the log sizes in compar-
ison to CSV log files. JSON files are generally larger than CSV files as more
characters and elements are used to represent the same data. In addition to
this, multiple new fields are added to each event by Winlogbeat.
40
As Winlogbeat essentially is a log shipper, designed to ship logs to a cent-
ralized log management solution, theses fields are added in order to dif-
ferentiate between the different clients sending the logs and for query pur-
poses. A file comparison conducted on log files processed by Winlogbeat
and raw CSV log files revealed that Winlogbeat files are on average 340%
larger. However, compared to an equivalent EVTX file, the JSON files are
on average 53% smaller. The trade-off for a standardized format and ease
of processing is that of file size. The difference in file size can be reduced
by dropping irrelevant fields added by Winlogbeat. Fields can be dropped
through the configuration of Winlogbeat, the drop_field processor in the
configuration file of Winlogbeat specifies which fields to drop.
41
Figure 4.1: Data processing overview
42
4.2.1 Applying labels to logs
As part of the contribution from the research conducted in this thesis, a
tool was developed that can convert raw log files into labeled datasets
which can be used to train machine learning based intrusion detection
systems. This involved labeling each line in the log file stating whether
the action was part of a malicious or benign operation, as well as what
step in the attack chain, in terms of MITRE ATT&CK technique or tactic,
the malicious operation is related to. The efficiency of the IDS is highly
dependent on correctly labeled training data as highlighted by Catania et
al. [7] and Davis et al. [14], as a part of quality assurance the processed
log files were manually reviewed and analyzed. The labeling tool searches
for JSON files created by Winlogbeat from the Sysmon logs in a specified
folder where unprocessed logs are stored. These files are then processed
and the output is saved in a standardized JSON format in a separate folder
containing labeled logs. The tool leverages process relations by matching
the process identifiers (PIDs) of malicious events with PIDs found in
subsequent events, utilizing recursion when processing the log files. A
unique identifier is given to a process once it is created, thus only one
process can be directly tied to a PID. The tool was developed through three
main iterations, the final and complete version is centered around a report
that CALDERA can generate after an emulation. Data from this report is
extracted and leveraged to find the initial process of each operation, and
what ATT&CK tactic and technique the process is related to. Any process
or operation seen in relation to the initial process are considered part of the
ongoing operation and labeled accordingly.
First Iteration
In the first iteration the labeling tool started off with defining two lists
within the program, one containing the Winlogbeat file to be processed,
and one empty list which is gradually filled with labeled logs. This was
done so that the tool could iterate through the Winlogbeat file and move
events found to malicious or benign to the processed list once labeled. The
tool would search for an occurrence of the CALDERA agent by matching
the process name in the events with the name of the agent. This signals
the start of the malicious attack as each attack starts with a command being
executed by the agent. The naming of the agent is specified in CALDERA
and was coded into the tool. Once the executable has been detected, the
process ID (PID) is extracted from the list. This event is then moved from
the unprocessed list onto the processed list and labeled as malicious. When
the first known malicious PID was found, the tool would start searching for
any occurrences of this PID in the following events. If the PID was observed
in any relation to an event, it would be considered malicious and any new
PID’s found in this event would be extracted and considered malicious.
Once the tool had found all events in relation to the initial PID it would
start a new iteration searching for the next known malicious PID. The list
containing unprocessed events shrinks as malicious events are removed
43
from it, causing each subsequent occurrence of recursion to have less events
to search through. The remaining events are labeled as benign and moved
onto the processed list once all malicious events have been analysed.
This worked to an extent, the tool was able to find the malicious events
but it was not able to apply labels based on what technique or tactic they
were related to. Wrongly labeling of benign events also occurred as the
tool did not take into consideration what timestamp the event had. An
PID used for a malicious operation thirty minutes earlier is not necessary
conducting malicious operations at a later time. Additionally, changing
the name of the CALDERA agent would require the tool to be modified in
order to search for the new name. The tool was completely reworked due
to these issues.
Second Iteration
During this thesis, MITRE released version 4.0.0 of CALDERA, this version
had the capability to produce an report with information regarding each
operation conducted during the emulation.
The second iteration of the tool leveraged this report to extract the
agent PID, the first child process of the agent for each operation, and
the ATT&CK tactic and technique related to the specific operation. With
this information available, the tool was able to apply fine-grain labels
including what ATT&CK tactic and technique each event was related to.
44
The logic and idea behind the first version was kept as the tool still utilized
iteration and the PID’s to find the malicious events. The tool reads the
CALDERA rapport, extracting information from the AgentMetaData and
AttackMetaData fields. AgentMetaData contains information regarding
the CALDERA agent, such as the PID of the process spawned by the agent
and the PID of the agent itself. The MetaData field contains information
regarding the tactic and technique related to the specific PID, which is used
to apply the labels. Each operation in the CALDERA report is read as an
individual object. Once the information from the object has been extracted
the tool start to process the Winlogbeat file, reading each line from the
file and matching the PID observed with the known malicious PIDs. If
the event is found to be malicious and the PID related to the event is not
currently in the list of malicious PIDs, the tool will add this PID to the list,
which is related back to the specific operation. This event is then labeled
as malicious and the tactic and technique related to the PID is applied. In
the next line of Winlogbeat the tool will now search for both of the PIDs
within the list, repeating the process for any malicious events. Once the
tool reaches the end of the Winlogbeat file, it proceeds to read the next
object from the CALDERA report, replace the list of malicious PIDs with
the new PID and relate the new list to a specific tactic and technique. This
process is repeated until all objects in the report has been searched for in
the Winlogbeat file.
45
Therefore it is assumed that any interactions on the environment per-
formed by the agent (e.g., process injection, privilege escalation, file cre-
ation/modification etc), are malicious. If the PID of the agent is observed
in any relation to another process, be it a benign or malicious process, that
specific process is considered part of a malicious action. An example of
this could be the agent performing process injection by executing code in
the address space of a separate process. This can be detected through Sys-
mon [48] with the event ID 8 (Remote Thread Creation Detected).
The labeling tool starts off with extracting the PID, technique_name,
and techinque_id found in Figure 4.3. This is placed onto a temporary
"maliciousPID" list within the labeling tool, which is later used to match
the PIDs in the list with PIDs observed in events within the Winlogbeat
file.
46
In Figure 4.4 we observe the resulting Sysmon event from executing the
MITRE ATT&CK technique T1055.004 - Process Injection: Asynchronous
Procedure Call [37], CALDERA utilizes the payload 0cb710_T1055.exe in
order to execute this technique. The information found in this event is also
found in the Winlogbeat file which is being processed. The labeling tool
compares the ProcessId (4576) and ParentProcessId (4656) with the PID in
the "maliciousPID" list, and detects that the ParentProcessId (4656) is in the
"maliciousPID" list. This event is then deemed malicious, and labels are
applied as shown in Figure 4.5.
The PID 4576 has now been seen in a relation to a malicious PID, as such
this PID is considered malicious and extracted onto the "maliciousPID" list.
This PID is then compared with all PID’s found in subsequent events in
order to find all malicious operations conducted and the next malicious
PID. In EventData2 Figure 4.4, we can observe that PID 4576 spawns a
new process, ProcessId 9076, when executing the process injection payload.
Since PID 4576 is now in the "maliciousPID" list, this event is found to be
malicious and is labeled accordingly. All PIDs found in relation to the first
malicious PID will receive the same labels as the first PID.
47
Figure 4.6: Truncated Sysmon logs related to a process injection executed
by CALDERA
48
Figure 4.7: Process Tree developed during the execution of T1055.044
Once the labeling tool has found and labeled all operations related to
Figure 4.3, the tool moves on to the next operation found in the CALDERA
log, extracting the new information and repeating the labeling process.
Final Iteration
With the second iteration of the labeling tool it was now able to find all
processes with an PID relation and apply labels with a high granularity.
There was still an issue with some benign events being wrongly labeled as
a result of the process executing operations at a later time. This issue was
addressed by implementing a function which checks if the malicious event
in Winlogbeat is within the time frame of the malicious operation. Each
operation in the CALDERA log has a agent delegated and finished time
stamp. This was leveraged in order to check if the detected malicious event
is within these two timestamps. In order for this to function as intended
the internal clock of the CALDERA server and the targeted host has to be
synchronized. However, the final event in each malicious operation had the
same timestamp as the agent finished time stamp, resulting in these events
not being correctly labeled. This was addressed by developing a method
for truncating the millisecond part of the timestamps.
Argument for this time based approach is that each operation occurs
within the delegated and finished time stamp. Finished time stamp is
applied once the agent reports back the outcome of the operation. If an
event is found to be malicious, but is outside of the expected operation
time, the label “uncertain +” technique id and tactic is applied in addition
to the “isMalicious: False” verdict. The false verdict is applied since there
is a higher probability that the event is benign than malicious. However,
the additional “uncertain” label indicates that the event may need to be
manually reviewed in order to determine the true nature of it.
49
The tool was also adjusted in order to correctly label Command and
Control traffic as Command and Control. These events received the label
from the PID related to the traffic, scenarios occurred where Command and
Control traffic was labeled as eg., process injection, this was addressed by
checking if the event ID from Sysmon was three (3). Event ID 3 in Sysmon
is Network Connection, if an event is malicious with Event ID 3 and within
the expect time frame it is considered as Command and Control traffic.
Manual inspection after this change showed that all Command and Control
labels were correctly applied during the emulation.
50
Each line of the Winlogbeat file is read and deserialized into a
Winlogbeat object, and each object is added to a list for further processing.
A catch clause is implemented to catch invalid Winlogbeat input which
would have prevented the program from running. This is seen in listing
4.2
1
2 L i s t <Winlogbeat > GetWinlogbeats ( s t r i n g path )
3 {
4 IEnumerable < s t r i n g > WinlogbeatReadLines = F i l e . ReadLines ( path )
;
5 var Winlogbeats = new L i s t <Winlogbeat > ( ) ;
6 var i = 0 ;
7 f o r e a c h ( var Winlogbeat i n WinlogbeatReadLines )
8 {
9 try
10 {
11 i += 1 ;
12 var d e s e r i a l i z e d W i n l o g b e a t = JsonConvert .
D e s e r i a l i z e O b j e c t <Winlogbeat >( Winlogbeat ) ;
13 i f ( d e s e r i a l i z e d W i n l o g b e a t ! = n u l l ) Winlogbeats . Add(
deserializedWinlogbeat ) ;
14 }
15 c a t c h ( Ex ce pti on e ) {
16 Console . WriteLine ( e ) ; }
17 }
18 r e t u r n Winlogbeats ;
19 }
Malicious operations are placed onto a list containing the process ID,
tactic and technique related to the malicious operation.
1
2 IEnumerable <MaliciousOperation >? G e t M a l i c i o u s O p e r a t i o n s ( s t r i n g
path )
3 {
4 IEnumerable <MaliciousOperation >? m a l i c i o u s O p e r a t i o n s =
5 JsonConvert . D e s e r i a l i z e O b j e c t < L i s t <MaliciousOperation > >(
F i l e . ReadAllText ( path ) ) ;
6 return maliciousOperations ;
7 }
51
The constructor of LogLabler takes inn the list of Winlogbeat and ma-
liciousOperations as arguments, as shown in listing 4.5. A new list called
MaliciousPid is created based on the list of maliciousOperations. Inform-
ation regarding the tactic, technique and timestamps of the operations are
added to this list. In line 6, the PID of the process spawned by the agent is
added, line 9 adds the PID of the agent while line 13 adds the parent PID
of the agent. This was done so that all PIDs initially related to a malicious
operation is added to the list of malicious PIDs. This list is later used in or-
der to search for new malicious PIDs and relate these back to specific tactics
and techniques.
1
2 p u b l i c LogLabeler ( L i s t <Winlogbeat > Winlogbeats , IEnumerable <
MaliciousOperation > m a l i c i o u s O p e r a t i o n s )
3 {
4 t h i s . Winlogbeats = Winlogbeats ;
5 var o p e r a t i o n s = m a l i c i o u s O p e r a t i o n s . S e l e c t ( x => new
M a l i c i o u s P i d ( x . pid , x . a t t a c k _ m e t a d a t a . t e c h n i q u e _ i d
,
6 x . a t t a c k _ m e t a d a t a . technique_name , x .
delegated_timestamp , x . finished_timestamp , x .
agent_metadata ) ) . T o L i s t ( ) ;
7
8 o p e r a t i o n s . AddRange ( m a l i c i o u s O p e r a t i o n s . S e l e c t ( x =>
new M a l i c i o u s P i d ( x . agent_metadata . pid ,
9 x . attack_metadata . technique_id , x . attack_metadata .
technique_name , x . delegated_timestamp , x .
finished_timestamp ,
10 x . agent_metadata ) ) . T o L i s t ( ) ) ;
11
12 o p e r a t i o n s . AddRange ( m a l i c i o u s O p e r a t i o n s . S e l e c t ( x =>
new M a l i c i o u s P i d ( x . agent_metadata . ppid ,
13 x . attack_metadata . technique_id , x . attack_metadata .
technique_name , x . delegated_timestamp , x .
finished_timestamp ,
14 x . agent_metadata ) ) . T o L i s t ( ) ) ;
15
16 t h i s . maliciousPids = operations . D i s t i n c t ( ) . ToList ( ) ;
17 }
52
1
2 p u b l i c LogLabeler FindAndMarkAllDescendantMaliciousOperations ( )
3 {
4 var newMaliciousPids = new L i s t <MaliciousPid > ( ) ;
5 f o r e a c h ( var pid i n t h i s . m a l i c i o u s P i d s )
6 {
7 i f ( pid . pid == n u l l ) continue ;
8 f o r e a c h ( var Winlogbeat i n Winlogbeats )
9 {
10 i f ( ! Winlogbeat . MatchesMaliciousPid ( pid ) ) continue ;
11 i f ( Winlogbeat . I s W i t h i n M al i c i o u s O pe r a t i o n T i me P e r i o d (
pid ) )
12 {
13 Winlogbeat . i s M a l i c i o u s = t r u e ;
14 Winlogbeat . v e r d i c t = " M a l i c i o u s " + pid .
technique_name + " − " + pid . t e c h n i q u e _ i d ;
15 }
16 else
17 {
18 i f ( Winlogbeat . winlog . e v e n t _ i d == " 3 " )
19 {
20 Winlogbeat . i s M a l i c i o u s = t r u e ;
21 Winlogbeat . v e r d i c t = " Malicious , command and
control t r a f f i c " ;
22 }
23 e l s e i f ( Winlogbeat . Timestamp ! = n u l l &&
Winlogbeat . Timestamp . Value . T r i m M i l l i s e c o n d s ( )
>=
24 m a l i c i o u s P i d s . S e l e c t ( x => x . agentMetadata .
c r e a t e d ) . Min ( ) . T r i m M i l l i s e c o n d s ( ) )
25 {
26 Winlogbeat . v e r d i c t = " U n c e r t a i n " + pid .
technique_name + " − " + pid . t e c h n i q u e _ i d ;
27 }
28 }
29 newMaliciousPids . Add( new M a l i c i o u s P i d ( Winlogbeat ? .
p r o c e s s ? . pid , pid . t e c h n i q u e _ i d , pid . technique_name
,
30 pid . delegated_timestamp , pid . finished_timestamp ,
pid . agentMetadata ) ) ;
31 newMaliciousPids . Add( new M a l i c i o u s P i d ( Winlogbeat ? .
p r o c e s s ? . p a r e n t ? . pid , pid . t e c h n i q u e _ i d , pid .
technique_name ,
32 pid . delegated_timestamp , pid . finished_timestamp ,
pid . agentMetadata ) ) ;
33 }
34 }
35 t h i s . m a l i c i o u s P i d s = newMaliciousPids . D i s t i n c t ( ) . T o L i s t ( ) ;
36 i f ( m a l i c i o u s P i d s . Count > 0 )
FindAndMarkAllDescendantMaliciousOperations ( ) ;
37 return t h i s ;
38 }
53
Chapter 5
Results
54
completely removed from the emulation plan. Once these modifications
were made, the emulation was able to fully execute with an acceptable ratio
of successful operations, according to the CALDERA report.
Figure 5.1: Successful and failed techniques from the CALDERA report
} else {
write-host "[!] readme.ps1 not found.";
return 1;
}
Readme.ps1 is found and executed, the script calls for multiple executables
within the modified Sysinternals Suits folder. Two of these are not found
in the folder and the execution returns an ItemNotFoundExeption. Since
the agent was able to execute the initial script and some data was returned,
it was considered successful in the report while the operation was in fact
not successful. This report also does not take into consideration which
55
operations in the emulation plan were dropped or skipped. An accurate
number of failed and successful techniques and tactics can be found by
comparing the emulation plan with the executed operations and manually
analyzing the results of each operation.
56
5.2 Final Dataset
Since the dataset is converted into a standardized format it may be
uploaded to a SIEM solution such as Splunk for further analysis. Splunk
was used to manually review the labels produced by the labeling tool. It
is expected that the labeled logs would share similar traits with figure 5.2
regarding the distribution of tactics. The created APT29 datasets consists
of 2,276 events, 868 malicious and 1 406 benign.
57
Taking into consideration that process ID’s observed outside the
operation time are not determined to be malicious, the label distribution
changes to 38,2% malicious, which is still considered to be bias towards
malicious logs. This is a more accurate description of the distribution as
manually inspecting the log lines labeled as "uncertain" revealed that the
majority of these are in fact benign activities. Out of the 278 events that
received this label, roughly 30 (10.8%) was manually deemed malicious.
Most of these were related back to the execution of scripts, which occurred
shortly after the time frame of the operation. Figure 5.4 is strictly based on
the labels applied from the labeling tool.
Malicious Logs
Figure 5.5 shows the distribution of techniques based on all malicious
labeled log lines. 32,719% (284/868 malicious events) of malicious labels
applied were related to discovery techniques. This is to be expected as
the emulation successfully executed twelve discovery techniques and the
emulation plan contained a clear majority of discovery related techniques.
While 21,198% (184/868 malicious events) of labels applied were related to
execution techniques, the emulation plan contained one unique execution
tactics. However, this technique was leveraged in multiple steps, execution
of this techniques also resulted in a significantly larger amount of events
in comparison to other techniques. T1059.001 - Command and Script
Interpreter usually involves the execution of scripts or payloads, which
resulted in a significant longer chain of related process ID’s compared
to some discovery techniques which only involved executing a single
PowerShell command. T1059.001 was also used in order to prepare the
payload for other tactics.
58
Figure 5.5: Distribution of tactics across malicious events
59
Figure 5.6: Technique label distribution in dataset
60
5.3 Labeling Tool
The developed labeling tool is able to successfully apply labels with a high
level of granularity. Labels are applied in two distinct ways:
• Adding a data field to each line named “isMalicious:”, which can
either be true or false.
• Adding an additional data field for further specification of malicious
labels named “verdict:”. This field is enriched with the MITRE tactic
and technique ID of the malicious activities related with the event
e.g., “Verdict: Malicious System Information Discovery - T1082”.
Labels applied are directly linked to tactics and techniques found within
MITRE ATT&CK, which can be utilized to develop a labeled dataset
containing all the steps within a kill chain. This could prove valuable
when analyzing and detecting a kill chain, individual phases of the kill
chain can be simulated, labeled and later merged with the other phases in
order to develop a complete dataset. The labeling tool successfully found
all process identifiers (PID’s) related in any manner to the malicious PID’s
extracted from the CALDERA logs. This is true as long as there is a relation
between the PID’s, such as interactions, child/parent/target relations, or
ProcessAccess to name a few. In the scenario where a PID does not have a
relation to a malicious PID or the PID was not captured, it will be labeled
as benign. In order to verify the labeling process multiple other scenarios
were executed with minimal background and benign activities. The time
windows of log collection was also adjusted to only collect logs during
the execution of these scenarios. Verifying the labels on a smaller dataset
did prove to be a more feasible task. Sysmon was also closely monitored
during these trail emulations. Within a time span of roughly thirty minutes,
Sysmon had generated 172 events, while the labeling tool detected 159
malicious events. No suspicious events were labeled as benign, the benign
events found were related to the host sending DNS queries to the domain
controller, which occurred even when the host was idle ahead of time.
61
In figure 5.7 we can observe labels related to Exfiltration tactics, as
previously stated these were skipped in the APT29 emulation plan, but
CALDERA is clearly capable of executing this tactic as well. Evaluating the
tool against other scenarios also verifies that the tool is not directly tied to
the APT29 emulation plan.
Drawbacks
The labeling tool is developed around CALDERA and will not function
properly without the CALDERA report. This is due to the labeling tool
using data from the CALDERA report in order to find the initial malicious
PID’s and relate these back to a specific tactic and technique. In order
for the tool to function without the CALDERA report it would have to be
reverted back to the previous version of it which searched for the process
name of the CALDERA agent in order to extract the malicious PIDs. A time
based approach for applying higher label granularity could substitute the
tactics and techniques extracted from the report, by knowing exactly what
time frames each technique was executed during and labeling malicious
events accordingly.
The tool was not able to perfectly label the logs as it utilizes process
relation, and some events could not be traced back to a malicious PID
or had a blank parent PID. In these scenarios the tool will wrongly label
these events as benign. However, this is a rare occurrence and only a
fraction of events were found to be potentially mislabeled benign after
the last development iteration of the tool. Benign events mislabeled as
“uncertain” since the process performed benign operations at a later time
is present in the APT29 dataset. With a shorter emulation, such as the trail
emulation, this drawback did not have a significant impact. Majority of
the benign events labeled as “uncertain” were related back to T1059.001 -
Command and Scripting Interpreter, with the PID 2388. During a malicious
operation a malicious PID was observed targeting 2388, which resulted in
this PID being added to the malicious PID list. 2388 was later used by the
GHOSTS framework to produce benign events. This PID was observed
outside the expected time frame and is given the “isMalicious: false” data
field however, since the PID had previously been considered malicious the
“verdict” data field receives the “uncertain” label.
Despite these drawbacks the tool was able to successfully label the
majority of malicious events with accuracy and high label granularity.
All events were not found as roughly 30/278 events that received the
"uncertain" label was later found to be malicious. These events did occur
close to the excepted time frame and were mostly related to the execution
of scripts. The tool is considered a viable source for labeling logs, but
refinement is necessary to achieve a perfect score.
62
Chapter 6
This chapter will discuss and compare the final result of the research, and
compare the result with the framework presented by Gharib et al. [18] in
Chapter 3.
63
malicious events during the actual simulation. Effectively removing the
possibility of misclassification from benign events related to background
processes and other operations that may have occurred on the host during
the development of the attack dictionary. Malicious events are instead
found by leveraging the process relation of events to effectively map all
processes spawned or affected in any way by the malicious operation. The
granularity of labels applied has also been improved. Labels applied are
fine-grain labels that are directly tied to specific tactics and techniques
within MITRE ATT&CK. Both of the labeling approaches in the developed
tool are conducted automatically by the labeling tool. The only manual
operations necessary when applying labels are the extraction of the
CALDERA report, and the unprocessed logs. The process of generating
attack data is also automated through CALDERA, the attacker only has
to select the desired attack scenario or emulation plan. The labeling tool
is not limited to the chosen emulation plan, this allows the tool to be
utilized in combination with a modular approach, where multiple smaller
attack scenarios can be executed and labeled. This allows researchers to
specifically select the attack data relevant to the subject, or merge these
smaller datasets into a complete dataset. It is also possible to quickly create
new datasets once CALDERA releases new APT emulation plans. Being
able to quickly develop labeled datasets containing new attacks as these
are discovered and implemented into the ATT&CK framework is a valuable
asset.
64
The decision to skip certain operations affects the attack diversity of the
dataset, which could cause biases in systems trained on the dataset. How-
ever, as the emulation successfully executed several other tactics, which
can be seen by the labels applied in figure 5.6, the dataset created in this
thesis is considered to have an adequate attack diversity. In addition, since
the emulation plan chosen in this thesis involves the execution of several
steps, the dataset is beneficial for the research around multistep attacks,
where the currently available datasets are rare [34]. It is also acknowledged
that the sample size of the dataset is low, 2,276 events where 868 are mali-
cious and 1 406 are benign. It was chosen to end the data collection shortly
after CALDERA had finished the emulation. Collecting the logs at a later
time would result in a larger sample of benign events due to the approach
of only labeling events as malicious if they occurred within the time frame
of an operation. All events occurring after the emulation would be found
benign as all operations had ended.
Myneni et al. [33] presented a APT focused dataset, DAPT 2020,
covering different attack vectors related to the later stages of an advanced
attack e.g. Privilege Escalation, Collection, and Exfiltration. The dataset
includes system event logs, MySQL Access logs, Audit host IDS logs,
Apache access logs, Authentication logs, logs from various services, and
DNS logs. The inclusion of multiple log sources makes DAPT 2020
heterogeneous as the samples have different traits. Mynein et al generated
benign traffic by having regular users perform what is considered routine
business operations throughout a week. This hinders the reproducibility of
DAPT 2020 as the benign traffic within the dataset is almost impossible to
reproduce without a detailed insight to exactly what and when operations
were conducted. Additionally, data in DAPT 2020 is not labeled as the
dataset was tested on a semi-supervised IDS, trained on benign data.
The dataset presented in this thesis may be considered homogeneous
in comparison to the DAPT 2020 dataset, as the dataset only contains
host based logs. However, as the dataset is a pure host based data, it is
made heterogeneous through the inclusion of logs regarding file systems,
certificates, processes, and call traces. Further, the dataset presented is
reproducible as both benign and malicious traffic is generated through
automation software, which can be easily replicated. The dataset can also
be used and tested on supervised intrusion detection systems as it includes
fine-grain labels. However, as shown in Section 5.1, the presented dataset
is missing data related to Persistence, Lateral Movement, and Exfiltration
tactics, all of which are an important aspect of an APT attack.
65
Gharib et al. [18] presented eleven features necessary for a comprehens-
ive and wholesome framework for generating IDS/IPS dataset. Although
the datasets discussed are network based, some of the presented criteria
can be applied to host based datasets. The characteristics derived from the
work presented by Gharib et al are explained in detail under Chapter 3.
66
6.3 Limitations
As with any study exploring new fields, this study is subjected to
limitations. The presented automated labeling tool is currently limited
to CALDERA, leveraging the report generated by CALDERA after an
emulation in order to both find and label malicious operations in the raw
dataset. The metadata within this report provides the labeling tool with
the initial malicious process ID, time frame, and tactic/technique of each
operation. The tool is not limited to the testbed designed in this thesis and
should work on Sysmon logs in any environment where CALDERA is used
to generate attack data. The implemented algorithm for finding malicious
events is prone to misclassification due to the nature of interleaving
processes if labels were applied solely based on process relations. This
was addressed by implementing a time-based label approach, which only
labels events as malicious if they are within the expected time frame of
an operation. If timed correctly, benign events may still be misclassified
as certain operations have a longer time frame, between two to three
minutes. This was not found during the manual analysis of the presented
dataset, but it is a possibility. If the process chain would be broken by
e.g., Process Spoofing, malformed data, or parent PID not found, the
subsequent malicious events will go undetected. Process Spoofing was
not tested during any of the conducted experiments and is not a part of
the APT29 emulation plan. Fine-grain labels are limited to the tactics and
techniques within the ATT&CK Matrix, and are only applied to events
found malicious. Regarding the labeling of benign events, the tool is
not able to differentiate between background and normal user generated
events. This was not prioritized as the focus of this thesis was to research
the possibilities of generating labeled APT attack data with the help of
CALDERA. The developed labeling tool is also limited to host logs, as the
implemented algorithm does not work on transmitted network data.
The presented APT29 dataset contains 32 fine-grain attack technique
labels, distributed across 8 known tactics. It is acknowledged that the
dataset is incomplete as it does not contain all of the expected operations
within the APT29 emulation plan. Further, it is acknowledged that
the dataset is imbalanced, with a bias toward malicious events. The
distribution of 65,8% benign and 35.2% malicious events is shown in Figure
5.4. Ideally, only a small portion of the dataset should be malicious as this
is representative of realistic datasets. The time period of the emulation is
considered unrealistic in comparison with real APT attacks. CALDERA
finished the emulation within two hours, while a realistic APT attack is
executed over a longer time period. In an ideal case, the emulation would
be divided and distributed across several weeks with a slower approach,
this was not performed due to time constraints. The presented dataset
can be considered a proof-of-concept for the developed labeling tool and
technique.
67
Chapter 7
68
niques within MITRE ATT&CK, resulting in attack labels of high granu-
larity. The resulting dataset can be used for kill chain, or multistep attack
detection as it can apply labels related to the various attack phases.
The implemented environment presents a use case for combining
CALDERA and Sysmon in order to generate raw APT logs. Benign
traffic is introduced through the General HOSTS (GHOSTS) framework
in an attempt to diversify the resulting dataset while still maintaining
the reproducibility of the generated dataset. The resulting benign data
was however not satisfactory as the automated actions were mainly web
browsing activities.
In conclusion, the research conducted in this thesis demonstrates the
possibilities of using CALDERA and Sysmon to generate raw attack logs.
It has introduced a new labeling tool and approach for applying fine-
grain attack labels to the raw logs, converting them into a fully labeled
dataset. An APT29 dataset is presented, while incomplete and imbalanced,
the dataset provides a proof-of-concept for the presented labeling tool
and technique. The environment created in this thesis is capable of
producing fully labeled attack dataset once new attacks are discovered and
implemented into CALDERA and the ATT&CK framework.
69
Another avenue for future work is to improve the emulation from
CALDERA in order to fully emulate all of the steps within the emulation
plan. The labeling tool should also be tested on other APT plans once these
are implemented into CALDERA. It is also recognized that in order to fully
evaluate the presented APT29 dataset, a machine learning model should be
trained on the dataset and tested. An interesting topic to further improve
this research area could be to replicate the study while capturing both host
and network logs. A study similar to the one presented in this thesis, but
regarding network dataset, was conducted by Julie L. Gjerstad [20]. The
network dataset presented by Julie could be merged with the APT29 host
dataset in order to create a complete dataset containing both network and
host data.
70
Appendices
71
.1 Labeled logs related to T1134.002
72
PID 2344 - DLL loading and file created
73
PID 2344 - Changes to Certificates
74
PID 2344 - Delete previous created file, registry changes
75
PID 2344 - Changes to registry.
76
PID 2344 - Spawns 7552 which in return spawns 968.
77
Bibliography
78
[13] Robert K Cunningham et al. Evaluating intrusion detection systems
without attacking your friends: The 1998 DARPA intrusion detection eval-
uation. Tech. rep. MASSACHUSETTS INST OF TECH LEXINGTON
LINCOLN LAB, 1999.
[14] Jonathan J Davis and Andrew J Clark. ‘Data preprocessing for
anomaly based network intrusion detection: A review’. In: computers
& security 30.6-7 (2011), pp. 353–375.
[15] DetecionLab, Retrieved from https://detectionlab.network/.
[16] Thomas Edgar and David Manz. Research methods for cyber security.
Syngress, 2017.
[17] ElasticSearch Stack. Beats. Retrieved from https://www.elastic.co/beats/.
[18] Amirhossein Gharib et al. ‘An evaluation framework for intrusion
detection dataset’. In: 2016 International Conference on Information
Science and Security (ICISS). IEEE. 2016, pp. 1–6.
[19] GHOSTS Timeline Repository, retrieved from https://github.com/cmu-
sei/GHOSTS/tree/master/src/Ghosts.Client/Sample%20Timelines.
[20] Julie Lidahl Gjerstad. Generating labelled network datasets of APT with
the MITRE CALDERA framework. University of Oslo. 2022.
[21] Waqas Haider et al. ‘Windows based data sets for evaluation of
robustness of host based intrusion detection systems (IDS) to zero-
day and stealth attacks’. In: Future Internet 8.3 (2016), p. 29.
[22] https://attack.mitre.org/groups/G0016/.
[23] JavaScript Object Notation (JSON). Retrieved from https://www.json.org/json-
en.html.
[24] Fikret Kadiric. MasterThesis-LogLabeler. Version 1.0.0. May 2022. URL:
https://github.uio.no/fikretk/MasterThesis-LogLabeler/tree/main.
[25] Fikret Kadiric. MasterThesis-LogLabeler. Version 1.0.0. May 2022. URL:
https://github.com/Fiik/MasterThesis-LogLabeler.
[26] Kevin S Killourhy and Roy A Maxion. ‘Toward realistic and artifact-
free insider-threat data’. In: Twenty-Third Annual Computer Security
Applications Conference (ACSAC 2007). IEEE. 2007, pp. 87–96.
[27] Robert Koch, Mario Golling and Gabi Dreo Rodosek. ‘Towards
comparability of intrusion detection systems: New data sets’. In:
TERENA Networking Conference. Vol. 7. 2014.
[28] Max Landauer et al. ‘Have it Your Way: Generating Customized
Log Datasets With a Model-Driven Simulation Testbed’. In: IEEE
Transactions on Reliability 70.1 (2020), pp. 402–415.
[29] Tien-Chih Lin, Cheng-Chung Guo and Chu-Sing Yang. ‘Detecting
Advanced Persistent Threat Malware Using Machine Learning-
Based Threat Hunting’. In: European Conference on Cyber Warfare and
Security. Academic Conferences International Limited. 2019, pp. 760–
XX.
79
[30] Vasileios Mavroeidis and Audun Jøsang. ‘Data-driven threat hunting
using sysmon’. In: Proceedings of the 2nd International Conference on
Cryptography, Security and Privacy. 2018, pp. 82–88.
[31] Michael Haag. Resources for learning about deploying, managing and hunt-
ing with Microsoft Sysmon. Retrieved from https://github.com/MHaggis/sysmon-
dfir.
[32] Microsoft. Audit Policy Recommendations.
Retrieved from https://docs.microsoft.com/en-us/windows-server/identity/ad-
ds/plan/security-best-practices/audit-policy-recommendations.
[33] Sowmya Myneni et al. ‘Dapt 2020-constructing a benchmark dataset
for advanced persistent threats’. In: International Workshop on Deploy-
able Machine Learning for Security Defense. Springer. 2020, pp. 138–163.
[34] Julio Navarro, Aline Deruyver and Pierre Parrend. ‘A systematic
survey on multi-step attack detection’. In: Computers & Security 76
(2018), pp. 214–249.
[35] Joshua Ojo Nehinbe. ‘A critical evaluation of datasets for investigat-
ing IDSs and IPSs researches’. In: 2011 IEEE 10th International Confer-
ence on Cybernetic Intelligent Systems (CIS). IEEE. 2011, pp. 92–97.
[36] Olaf Hartong. Sysmon Configuration File.
Retrieved from https://github.com/olafhartong/sysmon-modular.
[37] Process Injection: Asynchronous Procedure Call,
Retrieved from https://attack.mitre.org/techniques/T1055/004/.
[38] Rapid7. Metasploit Penetration Testing Software.
Retrieved from http://www.metasploit.com.
[39] Saurabh Singh et al. ‘A comprehensive study on APT attacks
and countermeasures for future networks and communications:
challenges and solutions’. In: The Journal of Supercomputing 75.8
(2019), pp. 4543–4574.
[40] Branka Stojanović, Katharina Hofer-Schmitz and Ulrike Kleb. ‘APT
datasets and attack modeling for automated detection methods: A
review’. In: Computers & Security 92 (2020), p. 101734.
[41] SwiftOnSecurity Sysmon Configuration File,
Retrieved from https://github.com/SwiftOnSecurity/sysmon-config.
[42] The mitre corporation, "Adversary Emulation Plans," MITRE ATT&CK
Retrieved from https://attack.mitre.org/resources/adversary-emulation-plans.
[43] The mitre corporation, "MITRE ATT&CK Framework". Retrieved from
https://attack.mitre.org/.
[44] Dustin D Updyke et al. Ghosts in the machine: A framework for
cyber-warfare exercise npc simulation. Tech. rep. CARNEGIE-MELLON
UNIV PITTSBURGH PA, 2018.
[45] Dustin D Updyke et al. Ghosts in the machine: A framework for
cyber-warfare exercise npc simulation. Tech. rep. CARNEGIE-MELLON
UNIV PITTSBURGH PA, 2018.
80
[46] US Government, APT29 ties to SolarWinds.
Retrieved from https://www.whitehouse.gov/briefing-room/statements-releases
/2021/ 04/15/fact-sheet-imposing-costs-for-harmful-foreign-activities-by-the-
russian-government/.
[47] Vagrant, Retrieved from https://www.vagrantup.com/docs.
[48] Windows Sysinternals Suite. System Monitor.
Retrievend from https://docs.microsoft.com/en-us/sysinternals/downloads/sysmon.
81