0% found this document useful (0 votes)
28 views

Article APT

The document proposes a dynamic game framework to model long-term interaction between a stealthy attacker and proactive defender in cyber-physical systems. It captures stealthy and deceptive behaviors using a multi-stage game of incomplete information where each player's private information is unknown. The framework provides a prediction of players' policies at the perfect Bayesian Nash equilibrium.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Article APT

The document proposes a dynamic game framework to model long-term interaction between a stealthy attacker and proactive defender in cyber-physical systems. It captures stealthy and deceptive behaviors using a multi-stage game of incomplete information where each player's private information is unknown. The framework provides a prediction of players' policies at the perfect Bayesian Nash equilibrium.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Version of Record: https://www.sciencedirect.

com/science/article/pii/S0167404819302020
Manuscript_87e795d2d244b766dc0a506db692ed49

A Dynamic Games Approach to Proactive Defense Strategies


against Advanced Persistent Threats in Cyber-Physical Systems
Linan Huanga , Quanyan Zhua
a Department of Electrical and Computer Engineering, New York University, Brooklyn, NY, 11201

ARTICLE INFO ABSTRACT


Keywords: Advanced Persistent Threats (APTs) have recently emerged as a significant security challenge for
advanced persistent threats a cyber-physical system due to their stealthy, dynamic and adaptive nature. Proactive dynamic
defense in depth defenses provide a strategic and holistic security mechanism to increase the costs of attacks
proactive defense and mitigate the risks. This work proposes a dynamic game framework to model a long-term
industrial control system security interaction between a stealthy attacker and a proactive defender. The stealthy and deceptive
cyber deception behaviors are captured by the multi-stage game of incomplete information, where each player
multi-stage Bayesian game has his own private information unknown to the other. Both players act strategically according to
perfect Bayesian Nash equilibrium their beliefs which are formed by the multi-stage observation and learning. The perfect Bayesian
Tennessee Eastman process Nash equilibrium provides a useful prediction of both players’ policies because no players benefit
from unilateral deviations from the equilibrium. We propose an iterative algorithm to compute
the perfect Bayesian Nash equilibrium and use the Tennessee Eastman process as a benchmark
case study. Our numerical experiment corroborates the analytical results and provides further
insights into the design of proactive defense-in-depth strategies.

1. Introduction
The recent advances in automation technologies, 5G networks, and cloud services have accelerated the development
of cyber-physical systems (CPSs) by integrating computing and communication functionalities with components in
the physical world. Cyber integration increases the operational efficiency of the physical system, yet it also creates
additional security vulnerabilities. First, the increased connectivity and openness have expanded the attack surface
and enabled attackers to leverage vulnerabilities from multiple system components to launch a sequence of stealthy
attacks. Second, the component heterogeneity, the functionality complexity, and the dimensionality of cyber-physical
systems have created many zero-day vulnerabilities, which make the defense arduous and costly.
Advanced Persistent Threats (APTs) are a class of emerging threats for cyber-physical systems with the following
distinct features. Unlike opportunistic attackers who spray and pray, APTs have specific targets and sufficient knowl-
edge of the system architecture, valuable assets, and even defense strategies. Attackers can tailor their strategies and
invalidate cryptography, firewalls, and intrusion detection systems. Unlike myopic attackers who smash and grab,
APTs are stealthy and can disguise themselves as legitimate users for a long sojourn in the victim’s system.
A few security researchers and experts have proposed APT models in which the entire intrusion process is divided
into a sequence of phases, such as Lockheed-Martin’s Cyber Kill Chain (see Hutchins, Cloppert and Amin (2011)),
MITRE’s ATT&CK (see Corporation (2019)), the NSA/CSS technical cyber threat framework (see Department of
Homeland Security (2018)), and the ones surveyed in Messaoud, Guennoun, Wahbi and Sadik (2016). Fig. 1 illustrates
the multi-stage structure of APTs. During the reconnaissance phase, a threat actor collects open-source or internal
intelligence to identify valuable targets. After the attacker obtains a private key and establishes a foothold, he escalates
privilege, propagates laterally in the cyber network, and eventually either accesses confidential information or inflicts
physical damage. Static standalone defense on a physical system cannot deter attacks originated from a cyber network.
The multi-phase feature of APTs results in the concept of Defense in Depth (DiD), i.e., multi-stage cross-layer
defense policies. A system defender should adopt defensive countermeasures across the phases of APTs and holistically
consider interconnections and interdependencies among these layers. To formally describe the interaction between an
APT attacker and a defender with the defense-in-depth strategy, we map the sequential phases of APTs into a game of
multiple stages. Each stage describes a local interaction between the attacker and the defender where the outcome leads
[email protected] (L. Huang); [email protected] (Q. Zhu)
ORCID (s):
00000003-15918749 (L. Huang); 00000002-00082953 (Q. Zhu)

L. Huang and Q. Zhu: Preprint submitted to Elsevier Page 1 of 24


© 2019 published by Elsevier. This manuscript is made available under the Elsevier user license
https://www.elsevier.com/open-access/userlicense/1.0/
Proactive Defense against Advanced Persistent Threats

Sensor

Social engineering

Physical damage
Controller
Insider threats

Web
phishing Private
key

Database Data collection


Physical
OSINT access and exfiltration

Reconnaissance Initial compromise Privilege escalation Lateral movement Mission completeness

Figure 1: An illustrate example of the multi-stage structure of APTs. The multi-stage attack is composed of reconnaissance,
initial compromise, privilege escalation, lateral movement, and mission execution. An attack originated from an early-stage
cyber network can lead to damage in a physical system.

to the next stage of interactions. The goal of the attacker is to stealthily reach the targeted physical or informational
assets while the defender aims to take defensive actions at multiple phases to thwart the attack or reduce its impact.
Detecting APTs timely (i.e., before attackers have reached the final stage) and effectively (i.e., with a low rate of
false alarms and missed detections) is still an open problem due to their stealthy and deceptive characteristics. As
reported in LLC (2018), US companies in 2018 have taken an average of 197 and 69 days, respectively, to detect
and contain a data breach. Stuxnet-like APT attacks can conceal themselves in a critical industrial system for years
and inconspicuously increase the failure probability of physical components. Due to the insufficiency of timely and
effective detection systems for APTs, the defender remains uncertain about the user’s type, i.e., either legitimate or
adversarial, throughout stages. To prepare for the potential APT attacks, the defender needs to adopt precautions and
proactive defense measures, which may also impair the user experience and reduce the utility of a legitimate user.
Therefore, the defender needs to strategically balance the tradeoff between security and usability when the user’s type
remains private.
In this work, we model the private information of the user’s type as a random variable following the work of
Harsanyi (1967). Under the same defense action, the behavior and the utility of a user depend on whether his type is
legitimate or adversarial. To make secure and usable decisions under incomplete information, the defender forms a
belief on the user’s type and updates the belief via the Bayesian rule based on the information acquired at each stage.
For example, throughout the phases of an APT, detection systems can generate many alerts based on suspicious user
activities. Although these alerts do not directly reveal the user’s type, a defender can use them to reduce the uncertainty
on the user’s type and better determine her defense-in-depth strategies at multiple stages.
Defensive deception provides an alternative perspective to bring uncertainty to the attacker and tilt the information
asymmetry. We classify a defender into different levels of sophistication based on factors such as her level of security
awareness, detection techniques she have adopted, and the completeness of her virus signature database. A sophisti-
cated defender has a higher success rate of detecting adversarial behaviors. Thus, the behavior of an attacker depends
on the type of defender that he interacts with. For example, the attacker may remain stealthy when he interacts with a
sophisticated defender but behaves more aggressively when interacting with a primitive defender. As the attacker has
incomplete information regarding the defender’s type, he needs to form a belief and continuously updates it based on
his observation of the defender’s actions. In this way, the attacker can optimally decide whether, when, and to what
extent, to behave aggressively or conservatively.
To this end, we also use a random variable to characterize the private information of the defender’s type. As both
players have incomplete information regarding the other player’s type and they make sequential decisions across multi-
ple stages, we extend the classical static Bayesian game to a multi-stage nonzero-sum game with two-sided incomplete
information. Both players act strategically according to their beliefs to maximize their utilities. The Perfect Bayesian
Nash Equilibrium (PBNE) provides a useful prediction of their policies at every stage for each type since no players can
benefit from unilateral deviations at the equilibrium. Computing the PBNE is challenging due to the coupling between
the forward belief update and the backward policy computation. We first formulate a mathematical programming
problem to compute the equilibrium policy pair under a given belief for the one-stage Bayesian game. For multi-stage
Bayesian games, we compute the equilibrium policy pair under a given sequence of beliefs by constructing a sequence

L. Huang and Q. Zhu: Preprint submitted to Elsevier Page 2 of 24


Proactive Defense against Advanced Persistent Threats

of nested mathematical programming problems. Finally, we combine these programs with the Bayesian update and
propose an efficient algorithm to compute the PBNE.
The proposed modeling and computational methods are shown to be capable of hardening the security of a broad
class of supervisory control and data acquisition (SCADA) systems. This work leverages the Tennessee Eastman pro-
cess as a case study of proactive defense-in-depth strategies against the APT attackers who can infiltrate into the cyber
network through phishing emails, escalate privileges through the process injection, tamper the sensor reading through
malicious encrypted communication, and eventually decrease the operational efficiency of the Tennessee Eastman
process without triggering the alarm. The dynamic game approach offers a quantitative way to assess the risks and
provides a systematic and computational mechanism to develop proactive and strategic defenses across multiple cyber
and physical stages. Based on the computation result of the case study, we obtain the following insights to guide the
design of practical defense systems.

• Defense at the final stage is usually too late to be effective when APTs have been well-prepared and ready to
attack. We need to take precautions and proactive responses in the cyber stages when the attack remains “under
the radar" so that the attacker becomes less dominant when they reach the final stage.
• The online learning capability of the defender plays an important role in detecting the adversarial deception and
tilting the information asymmetry. It increases the probability of identifying the hidden information from the
observable behaviors, threatens the stealthy attacker to take more conservative actions, and hence reduces the
attack loss.
• Third, defensive deception techniques are shown to be effective to introduce uncertainty to attackers, increase
their learning costs, and hence reduce the probability of successful attacks. Those techniques may introduce a
negative impact on legitimate users. However, a delicate balance between security and usability can be achieved
under proper designs.

1.1. Related Work


One well-known industrial solution to APT defense is the ATT&CK matrix (see Corporation (2019)). It illustrates
disclosed attack methods and possible detection and mitigation countermeasures at different phases of APTs. However,
as argued in Dufresne (2018), it lacks a prioritization to list all possible attack methods in one matrix. A lot of false
alarms can arise as legitimate users can also generate a majority of activities in the ATT&CK matrix. Besides, despite
a persistent update, the matrix is far from complete and can lead to miss detection.
Many papers have attempted to deal with the above two challenges, i.e., false alarms and miss detection. To
prevent security specialists from overwhelming alarms, Marchetti, Pierazzi, Colajanni and Guido (2016) has analyzed
high volumes of network traffic to reveal weak signals of suspect APT activities and ranked these signals based on
the computation of suspiciousness scores. To identify attacks that exploit zero-day vulnerabilities or other unknown
attack techniques, Friedberg, Skopik, Settanni and Fiedler (2015) has managed to learn and maintain a white-list of
normal system behaviors and report all actions that are not on the white-list. There is also a rich literature on detecting
essential components of an APT attack such as malicious PDF files in phishing emails (see Nissim, Cohen, Glezer
and Elovici (2015)), malicious SSL certificate during command and control communications (see Ghafir, Prenosil,
Hammoudeh, Han and Raza (2017)), and data leakage at the final stage of the APT campaign (see Sigholm and Bang
(2013)). These works have focused on a static detection of abnormal behaviors in one specific stage but had not taken
into account the correlation among multiple phases of APTs. Ghafir, Hammoudeh, Prenosil, Han, Hegarty, Rabie and
Aparicio-Navarro (2018) has managed to build a framework to correlate alerts across multiple phases of APTs based on
machine learning techniques so that all those alerts can be attributed to a single APT scenario. Ghafir, Kyriakopoulos,
Lambotharan, Aparicio-Navarro, AsSadhan, BinSalleeh and Diab (2019) has constructed a correlation framework to
link elementary alerts to the same APT campaign and applied the hidden Markov model to determine the most likely
sequence of APT stages.
An alternative perspective from the aforementioned APT detection frameworks is to address how to respond to and
mitigate potential attacks. Li, Yang, Xiong, Wen and Tang (2018) has captured the dynamic state evolution through a
network-based epidemic model and provided both prevention and recovery strategies for defenders based on optimal
control approaches. Since APTs are controlled by human experts and can act strategically, the defender’s response
should adapt to the potential change of APT behaviors. Thus, decision and game theory becomes a natural quantitative
framework to capture constraints on defense actions, attack consequences, and attackers’ incentives. Van Dijk, Juels,

L. Huang and Q. Zhu: Preprint submitted to Elsevier Page 3 of 24


Proactive Defense against Advanced Persistent Threats

Oprea and Rivest (2013) has proposed FlipIt game to model the key leakage under APTs as a private takeover between
the system operator and the attacker. Many works have integrated FlipIt with other components for the APT defense
such as the signaling game to defend cloud service (see Pawlick, Chen and Zhu (2018)), an additional player to model
the insider threats (see Feng, Zheng, Hu, Cansever and Mohapatra (2015)), and a system of multiple nodes under limited
resources (see Zhang, Zheng and Shroff (2015)). The FlipIt has described a high-level abstraction of the attacker’s
behavior to understand optimal timing for resource allocations. However, for our purpose of developing multi-stage
defense policies, we need to provide a finer-grained model that can capture the dynamic interactions between players of
different types across multiple stages. Our game framework models heterogeneous adversarial and defensive behaviors
at multiple stages, allowing the prediction of attack moves and the estimation of losses using the equilibrium analysis.
Other security game models such as Zhu and Rass (2018); Yang, Li, Zhang, Yang, Xiang and Zhou (2018); Huang,
Chen and Zhu (2017) have provided dynamic risk management frameworks that allow the defender to response and
repair effectively. In particular, to model the multi-stage structure of APTs, Zhu and Rass (2018) has developed a
sequence of heterogeneous game phases, i.e., a static Bayesian game for spear phishing, a nested game for penetration,
and a finite zero-sum game for the final stage of physical-layer infrastructure protection. However, most of these
security game frameworks have assumed complete information. Our framework explicitly models the incomplete
information across the entire phases of APTs and introduces their belief updates based on multi-stage information for
making long-term strategic decisions.
Cyber deception is an emerging research area. Games of incomplete information are natural frameworks to model
the uncertainty and misinformation introduced by cyber deceptions. Previous works mainly focus on adversarial de-
ceptions where the deceiver is the attacker. For example, strategic attackers in Nguyen, Wang, Sinha and Wellman
manipulate the attack data to mislead the defender in finitely repeated security games. A defender, on the other hand,
can also initiate defensive deception techniques such as perturbations via external noises, obfuscations via revealing
useless information, or honeypot deployments as shown in Pawlick, Colbert and Zhu (2017). Horák, Zhu and Bošanskỳ
(2017) proposes a framework to engage with attackers strategically to deceive them against the attack goal without their
awareness. A honeypot which appears to contain valuable information can lure attackers into isolation and surveil-
lance. La, Quek, Lee, Jin and Zhu (2016) has used a Bayesian game to model deceptive attacks and defenses in a
honeypot-enabled network in the envisioned Internet of Things. Besides detection, a honeypot can also be used to ob-
tain high-level indicators of compromise under a proper engagement policy as shown in Huang and Zhu (2019a) where
several security metrics are investigated and the optimal engagement policy is learned by reinforcement learning. A
system can also disguise a real asset as a honeypot to evade attacks as shown in Rowe, Custy and Duong (2007). Our
work considers a dynamic Bayesian game with double-sided incomplete information to incorporate both adversarial
and defensive deceptions.
The preliminary versions of this work (see Huang and Zhu (2018, 2019b)) have considered a dynamic game with
one-sided incomplete information where attackers disguise themselves as legitimate users. This work extends the
framework to a two-sided incomplete information structure where primitive systems can also disguise themselves as
sophisticated systems. The new framework enables us to jointly investigate deceptions adopted by both attackers and
defenders, and strategically design defensive deceptions to counter adversarial ones. We also develop new method-
ologies to address the challenge of the coupled belief update in a generalize setting without the previous assumption
of the beta-binomial conjugate pair. In the case study, we investigate heterogeneous actions and cyber stages such as
web phishing and privilege escalation, whose utilities are no longer negligible. Moreover, we leverage the Tennessee
Eastman process with new performance metric and attack models to validate the efficacy of the proposed proactive
defense-in-depth strategies, the Bayesian learning, and the defensive deception.

1.2. Organization of the Paper


We summarize notations, variables, and acronyms in Table. 1 for readers’ convenience. We use pronoun ‘he’ for
the user and ‘she’ for the defender throughout this paper. The rest of the paper is organized as follows. Section 2
introduces the multi-stage game with incomplete information and three equilibrium concepts are defined in Section 3.
To compute these equilibria, we construct constrained optimization problems and an iterative algorithm in Section 4.
A case study of Tennessee Eastman process under APTs is presented in Section 5 with results in Section 6. Section 7
concludes the paper.

L. Huang and Q. Zhu: Preprint submitted to Elsevier Page 4 of 24


Proactive Defense against Advanced Persistent Threats

Table 1
Summary of notations, variables, and acronyms.
General Notation Meaning
𝐴 ∶= 𝐵 𝐴 is defined as 𝐵
Pr Probability
𝑓 ∶𝐴↦𝐵 A function or a mapping 𝑓 from domain 𝐴 to codomain 𝐵
𝔼𝑎∼𝐴 [𝑓 (𝑎)] Expectation of 𝑓 (𝑎) over random variable 𝑎 whose distribution is 𝐴
ℝ Set of real numbers
|𝐴| The cardinality of set 𝐴
𝑎∼𝐴 Random variable 𝑎 follows probability distribution 𝐴
𝟏{𝑥=𝑦} Indicator function which equals one when 𝑥 = 𝑦, and zero otherwise
{𝑎1 , ⋯ , 𝑎𝑛 } Set with 𝑛 elements 𝑎1 , ⋯ , 𝑎𝑛
𝐵⧵𝐴 Set of elements in 𝐵 but not in 𝐴

Variable Meaning
𝑖, 𝑗 ∈ {1, 2} Index for players in the game: 𝑖, 𝑗 = 1 for the defender and 𝑖, 𝑗 = 2 for the user
Θ𝑖 Set of all possible types of player 𝑖 ∈ {1, 2}
Δ(Θ𝑖 ) Space of probability distributions over type set Θ𝑖 of player 𝑖 ∈ {1, 2}
𝜃𝑖 ∈ Θ𝑖 Type of player 𝑖 ∈ {1, 2}
𝜃1𝐻 (resp. 𝜃1𝐿 ) The defender is sophisticated (resp. primitive)
𝜃2𝑏 (resp. 𝜃2𝑔 ) The user is adversarial (resp. legitimate)
𝐾 Total number of stages
𝑘 ∈ {0, 1, ⋯ , 𝐾} Stage index
𝑘0 ∈ {0, 1, ⋯ , 𝐾} Index for the initial stage
𝐴𝑘𝑖 Set of all possible actions of player 𝑖 ∈ {1, 2} at stage 𝑘 ∈ {0, 1, ⋯ , 𝐾}
Δ(𝐴𝑘𝑖 ) Space of probability distributions over the action set 𝐴𝑘𝑖
𝑎𝑘𝑖 ∈ 𝐴𝑘𝑖 Action of player 𝑖 ∈ {1, 2} at stage 𝑘 ∈ {0, 1, ⋯ , 𝐾}
ℎ𝑘 , 𝐻 𝑘 Action history and the set of all possible action histories at stage 𝑘 ∈ {0, 1, ⋯ , 𝐾}
𝑥𝑘 , 𝑋 𝑘 State and the set of all possible states at stage 𝑘 ∈ {0, 1, ⋯ , 𝐾}
𝑓𝑘 State transition function at stage 𝑘, i.e., 𝑥𝑘+1 = 𝑓 𝑘 (𝑥𝑘 , 𝑎𝑘1 , 𝑎𝑘2 )
𝑙𝑖𝑘 , 𝐿𝑘𝑖 Available Information and set of all available information for player 𝑖 at stage 𝑘
𝜎𝑖𝑘 , Σ𝑘𝑖 Behavioral strategy and the set of all behavioral strategies for player 𝑖 at stage 𝑘
𝜎𝑖𝑘 (𝑎𝑘𝑖 |𝑙𝑖𝑘 ) Probability of player 𝑖 taking action 𝑎𝑘𝑖 at stage 𝑘 based on the available information 𝑙𝑖𝑘
𝑘 ∶𝐾
𝜎𝑖 0 Player 𝑖’s behavioral strategies from stage 𝑘0 to 𝐾
∗,𝑘 ∶𝐾
𝜎𝑖 0 (𝜎𝑖∗,𝐾 ∶= 𝜎𝑖∗,𝐾∶𝐾 ) Player 𝑖’s behavioral strategies from stage 𝑘0 to 𝐾 at the equilibrium
𝑏𝑖 ∶ 𝐿𝑘𝑖 ↦ Δ(Θ𝑗 )
𝑘
Player 𝑖’s belief on the other player 𝑗’s type at stage 𝑘 based on the available information
𝑏𝑘𝑖 (𝜃𝑗 |𝑙𝑖𝑘 ) Probability of player 𝑗 being type 𝜃𝑗 when player 𝑖 observes information 𝑙𝑖𝑘 at stage 𝑘
Player 𝑖’s stage utility received at stage 𝑘 when the state is 𝑥𝑘 , player 𝑖 takes action 𝑎𝑘𝑖 ,
𝐽̄𝑖𝑘 (𝑥𝑘 , 𝑎𝑘1 , 𝑎𝑘2 , 𝜃1 , 𝜃2 , 𝑤𝑘𝑖 )
player 𝑖’s type is 𝜃𝑖 , and the noise is 𝑤𝑘𝑖
𝐽𝑖𝑘 (𝑥𝑘 , 𝑎𝑘1 , 𝑎𝑘2 , 𝜃1 , 𝜃2 ) Player 𝑖’s expected stage utility received at stage 𝑘 with the input of 𝑥𝑘 , 𝑎𝑘1 , 𝑎𝑘2 , 𝜃1 , 𝜃2
𝑘 ∶𝐾 𝑘 ∶𝐾 𝑘 ∶𝐾 Player 𝑖’s expected cumulative utility received from stage 𝑘0 to 𝐾 when the initial state
𝑈𝑖 0 (𝜎𝑖 0 , 𝜎𝑗 0 , 𝑥𝑘0 , 𝜃𝑖 ) 𝑘 ∶𝐾
is 𝑥𝑘0 , his/her type is 𝜃𝑖 , and the multi-stage strategies of player 𝑖 are 𝜎𝑖 0
𝑉𝑖𝑘 (𝑥𝑘 , 𝜃𝑖 ) 𝑘
Player 𝑖’s value function at state 𝑥 when his/her type is 𝜃𝑖

Acronym Meaning
APT(s) Advanced persistent threat(s)
SBNE Static Bayesian Nash equilibrium
DBNE Dynamic Bayesian Nash equilibrium
PBNE Perfect Bayesian Nash equilibrium

2. Dynamic Game Modelling of APT Attacks


There are two players in the game, player 1 is the user and player 2 is the defender. The stealthy, persistent, and
deceptive features of APTs result in incomplete information of the user’s type to the defender. We use a finite set Θ2
to accommodate all possible types of the user. For example, we consider a binary type set for the case study in Section

L. Huang and Q. Zhu: Preprint submitted to Elsevier Page 5 of 24


Proactive Defense against Advanced Persistent Threats

&'*
Interaction at % %- Interaction
Stage ! + 1 at Stage .
User’s Type
Legitimate % &'(

Belief update Imperfect


or and decision Detection
System Suspicious
Adversarial making user actions
User’s
Interaction Defender’s Type
action Defender’s
at Stage !
action Sophisticated

Belief update
Delayed observation
and decision or
%&
making Primitive

% , Interaction %( % &)( Interaction at


at Stage 0 Stage ! − 1

Figure 2: A block diagram of applying the defense-in-depth approach against multi-stage APT attacks. We denote the
user, the defender, and the system states in red, blue, and black, respectively. The defender interacts with the user from
stage 0 to stage 𝐾 in sequence where the output state of stage 𝑘 − 1 becomes the input state of stage 𝑘. At each stage
𝑘, the user observes the defender’s actions at previous stages, forms a belief on the defender’s type, and takes an action.
At the same time, the defender makes decisions based on the output of an imperfect detection system. The dotted line
means that the observation is not in real time, i.e., both players can only observe the previous-stage actions of the other
player.

5 and 6 where the user’s type 𝜃2 is either adversarial 𝜃2𝑏 or legitimate 𝜃2𝑔 . The APT attacker, i.e., the adversarial user,
disguises himself as the legitimate user, thus the defender does not know the type of the user. The set of the user’s type
can also be non-binary and incorporate different APT groups when their attack tools and targeted assets are different
(see FireEye (2017)).
The Defender can also be classified into different levels of sophistication based on various factors such as her level
of security awareness, detection techniques she adopted, and the completeness of her virus signature database. The
discrete type 𝜃1 distinguishes defenders of different sophistication levels and all the possible type values constitute the
defender’s type set Θ1 . For example, in our case study, the defender’s type 𝜃1 is either sophisticated 𝜃1𝐻 or primitive
𝜃1𝐿 . The defender can apply defensive deception techniques and keep her type private to the user. We assume that both
players’ type sets are commonly known. Each player knows his/her own type, yet not the other player’s type. Thus,
each player 𝑖 should treat the other player’s type as a random variable with an initial distribution 𝑏0𝑖 and update the
distribution to 𝑏𝑘𝑖 when obtaining new information at each stage 𝑘. We present the above belief update formally in
Section 2.3.

2.1. Multi-stage Transition


We formulate the interaction between the multi-stage APT attack and the cross-stage proactive defense into 𝐾
stages of sequential games with incomplete information, as shown in Fig. 2. At each stage 𝑘 ∈ {0, 1, ⋯ , 𝐾}, player
𝑖 ∈ {1, 2} takes an action 𝑎𝑘𝑖 ∈ 𝐴𝑘𝑖 from a finite and discrete set 𝐴𝑘𝑖 . An intrusion detection system generates alerts
based on the user’s actions. However, since legitimate users can also trigger these alerts, each alert itself does not reveal
the user’s type. For example, an APT attacker uses the Tor network connection for data exfiltration, yet a legitimate
user can also use it legally for the traffic confidentiality as shown in Milajerdi and Kharrazi (2015). Another example

L. Huang and Q. Zhu: Preprint submitted to Elsevier Page 6 of 24


Proactive Defense against Advanced Persistent Threats

is that code obfuscation can be either used legitimately to prevent reverse engineering or illegally to conceal malicious
JavaScript code from being recognized by signature-based detectors or human analysts as shown in Nissim et al. (2015).
We assume that the user can observe the defender’s stage-𝑘 action at stage 𝑘 + 1. The observation of the defender’s
action at a single stage also does not reveal the defender’s type.
In this paper, each player obtains a one-stage delayed observation of the other player’s actions, i.e., at each stage
∏ ∏ ̄
𝑘, the action history available to both players is ℎ𝑘 = {𝑎01 , ⋯ , 𝑎𝑘−1
1
, 𝑎02 , ⋯ , 𝑎𝑘−1
2
} ∈ 𝐻 𝑘 ∶= 2𝑖=1 𝑘−1 ̄
𝑘=0
𝐴𝑘𝑖 . Given
history ℎ𝑘 at the current stage 𝑘, players at stage 𝑘 + 1 obtain an updated history ℎ𝑘+1 = ℎ𝑘 ∪ {𝑎𝑘1 , 𝑎𝑘2 } after the
observation of both players’ actions at stage 𝑘. At each stage 𝑘, we further define a state 𝑥𝑘 ∈ 𝑋 𝑘 which summarizes
information about both players’ actions in previous stages so that the initial state 𝑥0 ∈ 𝑋 0 and the history at stage 𝑘
uniquely determine 𝑥𝑘 through a known state transition function 𝑓 𝑘 , i.e., 𝑥𝑘+1 = 𝑓 𝑘 (𝑥𝑘 , 𝑎𝑘1 , 𝑎𝑘2 ), ∀𝑘 ∈ {0, 1, ⋯ , 𝐾 −1}.
States at different stages can have different meanings such as the reconnaissance outcome, the user’s location, the
privilege level, and the sensor status.

2.2. Behavioral Strategy


A defender should behave differently when interacting with adversarial users and legitimate ones. The defensive
measure should also vary for attackers who adopt different code families and tools. However, since the defender is
uncertain about the user’s type throughout the entire stages of games, she has to make judicious decisions at each stage
to balance usability versus security. The user’s action should also adapt to the type of the defender. For example, if
the defender is primitive, an attacker prefers to take aggressive adversarial actions to achieve a quicker and low-cost
compromise. However, if the defender is sophisticated and can detect the malware with better accuracy, an attacker
has to take conservative actions to remain stealthy. Since the proactive defense actions across the entire stages can
affect legitimate users, they also need to be designed to avoid collateral damage.
Thus, the decision-making problem of the defender or the user boils down to the determination of a behavioral
strategy 𝜎𝑖𝑘 ∈ Σ𝑘𝑖 ∶ 𝐿𝑘𝑖 ↦ Δ(𝐴𝑘𝑖 ), i.e., player 𝑖 at each stage 𝑘 needs to decide which action to take or take an action
with what probability based on the information 𝑙𝑖𝑘 ∈ 𝐿𝑘𝑖 available to him/her at stage 𝑘. We present two different
information structures in Section 2.3.1 and 2.3.2. The strategy is called ‘behavioral’ as the strategy depends on the
information available at the time the players make their decisions. In this work, players are allowed to take mixed
strategies, thus the co-domain of the strategy function 𝜎𝑖𝑘 is Δ(𝐴𝑘𝑖 ), a probability distribution over the action space 𝐴𝑘𝑖 .
With a slight abuse of notation, we denote 𝜎𝑖𝑘 (𝑎𝑘𝑖 |𝑙𝑖𝑘 ) as the probability of player 𝑖 taking action 𝑎𝑘𝑖 ∈ 𝐴𝑘𝑖 given the
available information 𝑙𝑖𝑘 ∈ 𝐿𝑘𝑖 . The actual action of player 𝑖 taken at stage 𝑘, i.e., 𝑎𝑘𝑖 , is a realization of the behavioral
strategy 𝜎𝑖𝑘 . Note that the values of the other player’s type 𝜃𝑗 and action 𝑎𝑘𝑗 are not observable for player 𝑖 at stage
𝑘, thus do not affect player 𝑖’s behavioral strategy 𝜎𝑖𝑘 , i.e., Pr(𝑎𝑘𝑖 |𝑎𝑘𝑗 , 𝜃𝑗 , 𝑙𝑖𝑘 ) = 𝜎𝑖𝑘 (𝑎𝑘𝑖 |𝑙𝑖𝑘 ). Therefore, 𝜎1𝑘 and 𝜎2𝑘 are
conditionally independent, i.e., Pr(𝑎𝑘𝑖 , 𝑎𝑘𝑗 |𝑙𝑖𝑘 , 𝑙𝑗𝑘 ) = 𝜎𝑖𝑘 (𝑎𝑘𝑖 |𝑙𝑖𝑘 )𝜎𝑗𝑘 (𝑎𝑘𝑗 |𝑙𝑗𝑘 ).

2.3. Belief and Bayesian Update


To quantify the uncertainty of the other player’s type throughout the entire stages, each player 𝑖 forms a belief
𝑏𝑘𝑖 ∶ 𝐿𝑘𝑖 ↦ Δ(Θ𝑗 ), 𝑗 ≠ 𝑖. Likewise, 𝑏𝑘𝑖 (𝜃𝑗 |𝑙𝑖𝑘 ) means that given information 𝑙𝑖𝑘 ∈ 𝐿𝑘𝑖 at stage 𝑘, player 𝑖 forms a belief
that the other player 𝑗 is of type 𝜃𝑗 ∈ Θ𝑗 with probability 𝑏𝑘𝑖 (𝜃𝑗 |𝑙𝑖𝑘 ). At the initial stage 𝑘 = 0, the only information
available to player 𝑖 is his/her own type, i.e., 𝑙𝑖0 = 𝜃𝑖 . We assume that player 𝑖 has a prior belief distribution 𝑏0𝑖 based
on the past experiences with the other player. If no previous experiences are available to player 𝑖, player 𝑖 can take the
uniform distribution as an unbiased prior belief. As each player 𝑖 obtains new information when arriving at the next
stage, his or her belief can be updated using the Bayesian rule. We present the Bayesian update under two different
information structures 𝐿𝑘𝑖 at stage 0 < 𝑘 ≤ 𝐾 in the following two subsections.

2.3.1. Timely Observations


The most straightforward information structure is 𝐿𝑘𝑖 = 𝐻 𝑘 × Θ𝑖 , i.e., the information available to player 𝑖 at stage
𝑘 is the action history ℎ𝑘 and player 𝑖’s own type 𝜃𝑖 , which leads to the belief update in (1).

𝜎𝑖𝑘 (𝑎𝑘𝑖 |ℎ𝑘 , 𝜃𝑖 )𝜎𝑗𝑘 (𝑎𝑘𝑗 |ℎ𝑘 , 𝜃𝑗 )𝑏𝑘𝑖 (𝜃𝑗 |ℎ𝑘 , 𝜃𝑖 )


𝑖 (𝜃𝑗 |ℎ ∪ {𝑎𝑖 , 𝑎𝑗 }, 𝜃𝑖 ) = ∑
𝑏𝑘+1 (1)
𝑘 𝑘 𝑘
, 𝑖, 𝑗 ∈ {1, 2}, 𝑗 ≠ 𝑖.
𝜃̄𝑗 ∈Θ𝑗 𝜎𝑖𝑘 (𝑎𝑘𝑖 |ℎ𝑘 , 𝜃𝑖 )𝜎𝑗𝑘 (𝑎𝑘𝑗 |ℎ𝑘 , 𝜃̄𝑗 )𝑏𝑘𝑖 (𝜃̄𝑗 |ℎ𝑘 , 𝜃𝑖 )

L. Huang and Q. Zhu: Preprint submitted to Elsevier Page 7 of 24


Proactive Defense against Advanced Persistent Threats

Here, player 𝑖 updates the belief 𝑏𝑘𝑖 based on the observation of the action 𝑎𝑘𝑖 , 𝑎𝑘𝑗 . When the denominator is 0, the history
ℎ𝑘+1 is not reachable from ℎ𝑘 , and the Bayesian update does not apply. In this case, we let 𝑏𝑘+1
𝑖 (𝜃𝑗 |ℎ ∪{𝑎𝑖 , 𝑎𝑗 }, 𝜃𝑖 ) ∶=
𝑘 𝑘 𝑘

𝑏0𝑖 (𝜃𝑗 |𝜃𝑖 ).

2.3.2. Markov Belief


If the information available to player 𝑖 at stage 𝑘 is the state value 𝑥𝑘 and player 𝑖’s own type 𝜃𝑖 , then the information
set is taken to be 𝐿𝑘𝑖 = 𝑋 𝑘 × Θ𝑖 . With the Markov property that Pr(𝑥𝑘+1 |𝜃𝑗 , 𝑥𝑘 , ⋯ , 𝑥1 , 𝑥0 , 𝜃𝑖 ) = Pr(𝑥𝑘+1 |𝜃𝑗 , 𝑥𝑘 , 𝜃𝑖 ),
the Bayesian update between two consequent states is

Pr(𝑥𝑘+1 |𝜃𝑗 , 𝑥𝑘 , 𝜃𝑖 )𝑏𝑘𝑖 (𝜃𝑗 |𝑥𝑘 , 𝜃𝑖 )


𝑖 (𝜃𝑗 |𝑥
𝑏𝑘+1 𝑘+1
, 𝜃𝑖 ) = ∑ , 𝑖, 𝑗 ∈ {1, 2}, 𝑗 ≠ 𝑖. (2)
𝜃̄𝑗 ∈Θ𝑗 Pr(𝑥𝑘+1 |𝜃̄𝑗 , 𝑥𝑘 , 𝜃𝑖 )𝑏𝑘𝑖 (𝜃̄𝑗 |𝑥𝑘 , 𝜃𝑖 )

With the conditional independence of 𝜎1𝑘 and 𝜎2𝑘 ,



Pr(𝑥𝑘+1 |𝜃𝑗 , 𝑥𝑘 , 𝜃𝑖 ) = 𝜎1𝑘 (𝑎𝑘1 |𝑥𝑘 , 𝜃1 )𝜎2𝑘 (𝑎𝑘2 |𝑥𝑘 , 𝜃2 ),
(3)
{𝑎𝑘1 ,𝑎𝑘2 }∈𝐴̄ 𝑘

where 𝐴̄ 𝑘 ∶= {𝑎𝑘1 ∈ 𝐴𝑘1 , 𝑎𝑘2 ∈ 𝐴𝑘2 |𝑥𝑘+1 = 𝑓 𝑘 (𝑥𝑘 , 𝑎𝑘1 , 𝑎𝑘2 )} contains all the action pairs that change the system state
from 𝑥𝑘 to 𝑥𝑘+1 . Equation (3) shows that the Bayesian update in (2) can be obtained from (1) by clustering all the
action pairs in set 𝐴̄ 𝑘 . Thus, the Markov belief update (2) can also be regarded as an approximation of (1) using action
aggregations. Unlike the history set 𝐻 𝑘 , the dimension of the state set, |𝑋 𝑘 |, does not grow with the number of stages.
Hence, the Markov approximation significantly reduces the memory and computational complexity. The following
sections adopt the Markov belief update.

2.4. Stage and Cumulative Utility


The player’s utility can vary under the same action taken by different types of users or defenders. For example,
the remote access from a legitimate teleworker brings a reward to the defender while the one from an adversarial user
inflicts a loss. Therefore, at each stage 𝑘, player 𝑖’s stage utility 𝐽̄𝑖𝑘 ∶ 𝑋 𝑘 × 𝐴𝑘1 × 𝐴𝑘2 × Θ1 × Θ2 × ℝ ↦ ℝ can depend
on both players’ types and actions, the current state 𝑥𝑘 ∈ 𝑋 𝑘 , and an external noise 𝑤𝑘𝑖 ∈ ℝ with a known probability
density function 𝜛𝑖𝑘 . The noise term models unknown or uncontrolled factors that can affect the value of the stage
utility. The existence of the external noise makes it impossible for player 𝑖, after reaching stage 𝑘 + 1, to infer the value
of the other player’s type 𝜃𝑗 based on the knowledge of the input parameters 𝑥𝑘 , 𝑎𝑘1 , 𝑎𝑘2 , 𝜃𝑖 , together with the output of
the utility function 𝐽̄𝑖𝑘 at stage 𝑘.
We denote the expected stage utility as 𝐽𝑖𝑘 (𝑥𝑘 , 𝑎𝑘1 , 𝑎𝑘2 , 𝜃1 , 𝜃2 ) ∶= 𝔼𝑤𝑘 ∼𝜛 𝑘 [𝐽̄𝑖𝑘 (𝑥𝑘 , 𝑎𝑘1 , 𝑎𝑘2 , 𝜃1 , 𝜃2 , 𝑤𝑘𝑖 )], ∀𝑥𝑘 , 𝑎𝑘1 , 𝑎𝑘2 , 𝜃1 , 𝜃2 .
𝑖 𝑖
𝑘 ∶𝐾
Given the type 𝜃𝑖 ∈ Θ𝑖 , the initial state 𝑥𝑘0 ∈ 𝑋 𝑘0 , and both players’ strategies 𝜎𝑖 0 ∶= [𝜎𝑖𝑘 (𝑎𝑘𝑖 |𝑥𝑘 , 𝜃𝑖 )]𝑘=𝑘0 ,⋯,𝐾 ∈
∏𝐾 𝑘0 ∶𝐾
𝑘=𝑘 Σ𝑖 from stage 𝑘0 to 𝐾, we can determine the expected cumulative utility 𝑈𝑖 for player 𝑖, i.e.,
𝑘
0

𝑘 ∶𝐾 𝑘 ∶𝐾 𝑘 ∶𝐾 ∑
𝐾
𝑈𝑖 0 (𝜎𝑖 0 , 𝜎𝑗 0 , 𝑥𝑘0 , 𝜃𝑖 ) ∶= 𝔼𝜃 𝑘 𝑘 𝑘 𝑘 𝑘 [𝐽𝑖𝑘 (𝑥𝑘 , 𝑎𝑘1 , 𝑎𝑘2 , 𝜃1 , 𝜃2 )]
𝑗 ∼𝑏𝑖 ,𝑎𝑖 ∼𝜎𝑖 ,𝑎𝑗 ∼𝜎𝑗
𝑘=𝑘0
(4)

𝐾 ∑ ∑ ∑
= 𝑏𝑘𝑖 (𝜃𝑗 |𝑥𝑘 , 𝜃𝑖 ) 𝜎𝑖𝑘 (𝑎𝑘𝑖 |𝑥𝑘 , 𝜃𝑖 ) ⋅ 𝜎𝑗𝑘 (𝑎𝑘𝑗 |𝑥𝑘 , 𝜃𝑗 )𝐽𝑖𝑘 (𝑥𝑘 , 𝑎𝑘1 , 𝑎𝑘2 , 𝜃1 , 𝜃2 ), 𝑗 ≠ 𝑖.
𝑘=𝑘0 𝜃𝑗 ∈Θ𝑗 𝑎𝑘𝑖 ∈𝐴𝑘𝑖 𝑎𝑘𝑗 ∈𝐴𝑘𝑗

3. PBNE and Dynamic Programming


The user and the defender use the Bayesian update to reduce their uncertainties on the other player’s type. Since
their actions affect the belief update, both players at each stage should optimize their expected cumulative utilities
concerning the updated beliefs at the future stages, which leads to the Perfect Bayesian Nash Equilibrium (PBNE) in
Definition 1.

L. Huang and Q. Zhu: Preprint submitted to Elsevier Page 8 of 24


Proactive Defense against Advanced Persistent Threats

Definition 1. Consider the two-person 𝐾-stage game with double-sided incomplete information (i.e., each player’s
type is not known to the other player), a sequence of beliefs 𝑏𝑘𝑖 , ∀𝑘 ∈ {0, ⋯ , 𝐾}, an expected cumulative utility 𝑈𝑖0∶𝐾

in (4), and a given scalar 𝜀 ≥ 0. A sequence of strategies 𝜎𝑖∗,0∶𝐾 ∈ 𝐾 𝑘
𝑘=0 Σ𝑖 is called 𝜀-dynamic Bayesian Nash
equilibrium for player 𝑖 if condition (C2) is satisfied. If condition (C1) is also satisfied, 𝜎𝑖∗,0∶𝐾 is further called 𝜀-
perfect Bayesian Nash equilibrium.
(C1) Belief consistency: under strategy pair (𝜎1∗,0∶𝐾 , 𝜎2∗,0∶𝐾 ), each player’s belief 𝑏𝑘𝑖 at each stage 𝑘 = 0, ⋯ , 𝐾
satisfies (2).
(C2) Sequential rationality: for all given initial state 𝑥𝑘0 ∈ 𝑋 𝑘0 at every initial stage 𝑘0 ∈ {0, ⋯ , 𝐾},

𝑘 ∶𝐾 ∗,𝑘 ∶𝐾 ∗,𝑘 ∶𝐾 𝑘 ∶𝐾 ∗,𝑘 ∶𝐾 𝑘 ∶𝐾 ∏


𝐾
𝑈1 0 (𝜎1 0 , 𝜎2 0 , 𝑥𝑘0 , 𝜃1 ) + 𝜀 ≥ 𝑈1𝑘∶𝐾 (𝜎1 0 , 𝜎2 0 , 𝑥𝑘0 , 𝜃1 ), ∀𝜎1 0 ∈ Σ𝑘1 ;
𝑘=0
(5)
𝑘 ∶𝐾 ∗,𝑘 ∶𝐾 ∗,𝑘 ∶𝐾 ∗,𝑘 ∶𝐾 𝑘 ∶𝐾 𝑘 ∶𝐾 ∏𝐾
𝑈2 0 (𝜎1 0 , 𝜎2 0 , 𝑥𝑘0 , 𝜃2 ) + 𝜀 ≥ 𝑈2𝑘∶𝐾 (𝜎1 0 , 𝜎2 0 , 𝑥𝑘0 , 𝜃2 ), ∀𝜎2 0 ∈ Σ𝑘2 .
𝑘=0

When 𝜀 = 0, the two 𝜀-equilibria are called Dynamic Bayesian Nash Equilibrium (DBNE) and Perfect Bayesian Nash
Equilibrium (PBNE), respectively.

The belief consistency emphasizes that when strategic players make long-term decisions, they have to consider
the impact of their actions on their opponent’s beliefs at future stages. The PBNE is a refinement of the DBNE
with the additional requirement of the belief consistency property. When the horizon 𝐾 = 0, the multi-stage game
of incomplete information defined in Section 2 degenerates to a one-stage (static) Bayesian game with the one-stage
belief pairs (𝑏𝐾1
, 𝑏𝐾
2
) and the solution concept of the DBNE/PBNE degenerates to the Static Bayesian Nash Equilibrium
(SBNE) in Definition 2.
The sequential rationality property in (5) guarantees that unilateral deviations from the equilibrium at any states do
not benefit the deviating player. Thus, the equilibrium strategy can be a reasonable prediction of both players’ multi-
stage behaviors. DBNE strategies have the property of strongly time consistency because (5) holds for any possible
initial states, even for states that are not on the equilibrium path, i.e., those states would not be visited under DBNE
strategies. The strongly time consistency property makes the DBNE adapt to unexpected changes. Solutions obtained
by dynamic programming naturally satisfy strongly time consistency. Hence, in the following, we introduce algorithms
based on dynamic programming techniques.
𝑘 𝑘 ∶𝐾 ∗,𝑘 ∶𝐾 ∗,𝑘 ∶𝐾
Define the value function 𝑉𝑖 0 (𝑥𝑘0 , 𝜃𝑖 ) ∶= 𝑈𝑖 0 (𝜎1 0 , 𝜎2 0 , 𝑥𝑘0 , 𝜃𝑖 ) as the utility-to-go from any initial stage
∗,𝑘0 ∶𝐾 ∗,𝑘0 ∶𝐾
𝑘0 ∈ {0, ⋯ , 𝐾} under the DBNE strategy pair (𝜎1 , 𝜎2 ). Then, at the final stage 𝐾, the value function for
player 𝑖 ∈ {1, 2} with type 𝜃𝑖 at state 𝑥𝐾 is

𝑉𝑖𝐾 (𝑥𝐾 , 𝜃𝑖 ) = sup 𝔼𝜃 [𝐽𝑖𝐾 (𝑥𝐾 , 𝑎𝐾 , 𝑎𝐾 , 𝜃1 , 𝜃2 )].


𝐾 𝐾 𝐾 𝐾 ∗,𝐾
𝑗 ∼𝑏𝑖 ,𝑎𝑖 ∼𝜎𝑖 ,𝑎𝑗 ∼𝜎𝑗 1 2 (6)
𝜎𝑖𝐾 ∈Σ𝐾
𝑖

For any feasible sequence of belief pairs (𝑏𝑘1 , 𝑏𝑘2 ), 𝑘 = 0, ⋯ , 𝐾 − 1, we have the following recursive system equa-
tions for player 𝑖 to find the equilibrium strategy pairs (𝜎1∗,𝑘 , 𝜎2∗,𝑘 ) backwardly from stage 𝐾 − 1 to the initial stage 0,
i.e., ∀𝑘 ∈ {0, ⋯ , 𝐾 − 1}, ∀𝑖, 𝑗 ∈ {1, 2}, 𝑗 ≠ 𝑖,

𝑉𝑖𝑘 (𝑥𝑘 , 𝜃𝑖 ) = sup 𝔼𝜃 [𝑉𝑖𝑘+1 (𝑓 𝑘 (𝑥𝑘 , 𝑎𝑘1 , 𝑎𝑘2 ), 𝜃𝑖 ) + 𝐽𝑖𝑘 (𝑥𝑘 , 𝑎𝑘1 , 𝑎𝑘2 , 𝜃1 , 𝜃2 )].
𝑘 𝑘 𝑘 𝑘 ∗,𝑘
𝑗 ∼𝑏𝑖 ,𝑎𝑖 ∼𝜎𝑖 ,𝑎𝑗 ∼𝜎𝑗 (7)
𝜎𝑖𝑘 ∈Σ𝑘𝑖

If we assume a virtual termination value 𝑉𝑖𝐾+1 (𝑓 𝐾 (𝑥𝐾 , 𝑎𝐾 1


, 𝑎𝐾
2
), 𝜃𝑖 ) ≡ 0, we can obtain (6) by letting stage 𝑘 = 𝐾
in (7). The second term in (7) represents the immediate stage utility and the first term represents the expected utility
under the future state 𝑥𝑘+1 = 𝑓 𝑘 (𝑥𝑘 , 𝑎𝑘1 , 𝑎𝑘2 ), 𝑘 ∈ {0, ⋯ , 𝐾 − 1}. Since 𝑎𝑘𝑖 affects both terms, players should adopt a
long-term perspective and avoid myopic behaviors to balance between the immediate utility and the expected future
utility.

L. Huang and Q. Zhu: Preprint submitted to Elsevier Page 9 of 24


Proactive Defense against Advanced Persistent Threats

4. Computational Algorithms
In 4.1, we formulate a constrained optimization problem to compute the SBNE and 𝑉𝑖𝐾 for the one-stage game. In
4.2, we use the proposed optimization problem as building blocks to compute the DBNE and 𝑉𝑖𝑘 , ∀𝑘 ∈ {0, ⋯ , 𝐾 − 1}.
Finally, we propose an iterative algorithm to solve for the PBNE. Efficient algorithms to compute the PBNE lay a solid
foundation to quantify the risk of cyber-physical attacks and guide the design of proactive defense-in-depth strategies.

4.1. One-Stage Bayesian Game and SBNE


Since both players’ actions at the final stage 𝑘 = 𝐾 only affect the immediate utility 𝐽𝑖𝐾 and there is no future state
transition, we can treat the final-stage game at each state 𝑥𝐾 ∈ 𝑋 𝐾 as an equivalent one-stage Bayesian game with the
belief 𝑏𝐾
𝑖 and obtain the SBNE.

Definition 2. A pair of mixed-strategies (𝜎1∗,𝐾 ∈ Σ𝐾 1


, 𝜎2∗,𝐾 ∈ Σ𝐾
2
) is said to constitute a Static Bayesian Nash Equi-
𝐾 𝐾
librium (SBNE) under the given belief pair (𝑏1 , 𝑏2 ) and the state 𝑥𝐾 ∈ 𝑋 𝐾 , if ∀𝜃1 ∈ Θ1 , 𝜃2 ∈ Θ2 ,

𝔼𝜃 𝐾 𝐾 ∗,𝐾 𝐾 ∗,𝐾 [𝐽1𝐾 (𝑥𝐾 , 𝑎𝐾


1
, 𝑎𝐾
2
, 𝜃1 , 𝜃2 )] ≥ 𝔼𝜃 𝐾 𝐾 𝐾 𝐾 ∗,𝐾 [𝐽1𝐾 (𝑥𝐾 , 𝑎𝐾
1
, 𝑎𝐾
2
, 𝜃1 , 𝜃2 )], ∀𝜎1𝐾 ∈ Σ𝐾
1
;
2 ∼𝑏1 ,𝑎1 ∼𝜎1 ,𝑎2 ∼𝜎2 2 ∼𝑏1 ,𝑎1 ∼𝜎1 ,𝑎2 ∼𝜎2
(8)
𝔼𝜃 𝐾 𝐾 ∗,𝐾 𝐾 ∗,𝐾 [𝐽2𝐾 (𝑥𝐾 , 𝑎𝐾
1
, 𝑎𝐾
2
, 𝜃1 , 𝜃2 )] ≥ 𝔼𝜃 𝐾 𝐾 ∗,𝐾 𝐾 𝐾 [𝐽2𝐾 (𝑥𝐾 , 𝑎𝐾
1
, 𝑎𝐾
2
, 𝜃1 , 𝜃2 )], ∀𝜎2𝐾 ∈ Σ𝐾
2
.
1 ∼𝑏2 ,𝑎1 ∼𝜎1 ,𝑎2 ∼𝜎2 1 ∼𝑏2 ,𝑎1 ∼𝜎1 ,𝑎2 ∼𝜎2

In Theorem 1, we propose a constrained optimization program 𝐶 𝐾 to compute the SBNE. We suppress the superscript
of 𝐾 without any ambiguity in one-stage games.

Theorem 1. A strategy pair (𝜎1∗ ∈ Σ1 , 𝜎2∗ ∈ Σ2 ) constitutes a SBNE to the one-stage bi-matrix Bayesian game (𝐽1 , 𝐽2 )
under private type 𝜃𝑖 ∈ Θ𝑖 , ∀𝑖 ∈ {1, 2}, belief 𝑏𝑖 , ∀𝑖 ∈ {1, 2}, and a given state 𝑥, if and only if the strategy pair is a
solution to 𝐶 𝐾 :
∑ ∑
[𝐶 𝐾 ] ∶ max 𝛼1 (𝜃1 )𝑠1 (𝑥, 𝜃1 ) + 𝛼2 (𝜃2 )𝑠2 (𝑥, 𝜃2 )
𝜎1 ,𝜎2 ,𝑠1 ,𝑠2
𝜃1 ∈Θ1 𝜃2 ∈Θ2

+ 𝛼1 (𝜃1 )𝔼𝜃2 ∼𝑏1 ,𝑎1 ∼𝜎1 ,𝑎2 ∼𝜎2 [𝐽1 (𝑥, 𝑎1 , 𝑎2 , 𝜃1 , 𝜃2 )]
𝜃1 ∈Θ1

+ 𝛼2 (𝜃2 )𝔼𝜃1 ∼𝑏2 ,𝑎1 ∼𝜎1 ,𝑎2 ∼𝜎2 [𝐽2 (𝑥, 𝑎1 , 𝑎2 , 𝜃1 , 𝜃2 )]
𝜃2 ∈Θ2

s.t. (𝑎) 𝔼𝜃1 ∼𝑏2 ,𝑎1 ∼𝜎1 [𝐽2 (𝑥, 𝑎1 , 𝑎2 , 𝜃1 , 𝜃2 )] ≤ −𝑠2 (𝑥, 𝜃2 ), ∀𝜃2 , ∀𝑎2 ,

(𝑏) 𝜎1 (𝑎1 |𝑥, 𝜃1 ) = 1, 𝜎1 (𝑎1 |𝑥, 𝜃1 ) ≥ 0, ∀𝜃1 ,
𝑎1 ∈𝐴1

(𝑐) 𝔼𝜃2 ∼𝑏1 ,𝑎2 ∼𝜎2 [𝐽1 (𝑥, 𝑎1 , 𝑎2 , 𝜃1 , 𝜃2 )] ≤ −𝑠1 (𝑥, 𝜃1 ), ∀𝜃1 , ∀𝑎1 ,

(𝑑) 𝜎2 (𝑎2 |𝑥, 𝜃2 ) = 1, 𝜎2 (𝑎2 |𝑥, 𝜃2 ) ≥ 0, ∀𝜃2 .
𝑎2 ∈𝐴2

The dimensions of decision variables 𝜎1 (𝑎1 |𝑥, 𝜃1 ), ∀𝜃1 ∈ Θ1 , and 𝜎2 (𝑎2 |𝑥, 𝜃2 ), ∀𝜃2 ∈ Θ2 , are |𝐴1 | × |Θ1 | and |𝐴2 | ×
|Θ2 |, respectively. Besides, 𝑠1 (𝑥, 𝜃1 ), ∀𝜃1 and 𝑠2 (𝑥, 𝜃2 ), ∀𝜃2 are scalar decision variables for each given 𝜃𝑖 , 𝑖 ∈ {1, 2}.
The non-decision variables 𝛼1 (𝜃1 ), ∀𝜃1 and 𝛼2 (𝜃2 ), ∀𝜃2 , can be any strictly positive and finite numbers. The solution
to 𝐶 𝐾 exists and is achieved at the equality of constraints (𝑎), (𝑐), i.e., 𝑠∗2 (𝑥, 𝜃2 ) = −𝑉2 (𝑥, 𝜃2 ), 𝑠∗1 (𝑥, 𝜃1 ) = −𝑉1 (𝑥, 𝜃1 ).

PROOF. The finiteness and discreteness of the action and the type spaces guarantee the existence of the SBNE in mixed
strategies as shown in Shoham and Leyton-Brown (2008), which further guarantee that program 𝐶 𝐾 has solutions. To
show the equivalence between the solution to 𝐶 𝐾 and the SBNE, we first show that every SBNE is a solution of
𝐶 𝐾 . If (𝜎1∗ ∈ Σ1 , 𝜎2∗ ∈ Σ2 ) is a SBNE pair, then the quadruple 𝜎1∗ (𝜃1 ), 𝜎2∗ (𝜃2 ), 𝑠∗2 (𝑥, 𝜃2 ) = −𝑉2 (𝑥, 𝜃2 ), 𝑠∗1 (𝑥, 𝜃1 ) =
−𝑉1 (𝑥, 𝜃1 ), ∀𝜃𝑖 ∈ Θ𝑖 , ∀𝑖 ∈ {1, 2}, is feasible because it satisfies constraints (𝑎), (𝑏), (𝑐), (𝑑). Constraints (𝑎) and (𝑐)
imply a non-positive objective function of 𝐶 𝐾 . Since the value of the objective function achieved under this quadruple

L. Huang and Q. Zhu: Preprint submitted to Elsevier Page 10 of 24


Proactive Defense against Advanced Persistent Threats

is 0, this quadruple is also optimal. Second, we show that 𝜎1∗ (𝜃1 ), 𝜎2∗ (𝜃2 ), 𝑠∗2 (𝑥, 𝜃2 ), 𝑠∗1 (𝑥, 𝜃1 ), the result of 𝐶 𝐾 is a
SBNE. The solution of 𝐶 𝐾 should satisfy all the constraints, i.e.,

𝔼𝜃1 ∼𝑏2 ,𝑎1 ∼𝜎 ∗ ,𝑎2 ∼𝜎2 [𝐽2 (𝑥, 𝑎1 , 𝑎2 , 𝜃1 , 𝜃2 )] ≤ −𝑠∗2 (𝑥, 𝜃2 ), ∀𝜃2 , ∀𝜎2 ∈ Σ2 ,
1
(9)
𝔼𝜃2 ∼𝑏1 ,𝑎1 ∼𝜎1 ,𝑎2 ∼𝜎 ∗ [𝐽2 (𝑥, 𝑎1 , 𝑎2 , 𝜃1 , 𝜃2 )] ≤ −𝑠∗1 (𝑥, 𝜃1 ), ∀𝜃1 , ∀𝜎1 ∈ Σ1 .
2

In particular, if we pick 𝜎𝑖 (𝜃𝑖 ) = 𝜎𝑖∗ (𝜃𝑖 ), ∀𝜃𝑖 , ∀𝑖 ∈ {1, 2}, and combine the fact that the optimal value is achieved at
0, the inequality turns out to be an equality and equation (9) becomes (8), which shows that (𝜎1∗ ∈ Σ1 , 𝜎2∗ ∈ Σ2 ) is a
SBNE. □

Theorem 1 focuses on the double-sided Bayesian game where each player player 𝑖 has a private type 𝜃𝑖 ∈ Θ𝑖 . To
accommodate the one-sided Bayesian game where player 𝑖’s type 𝜃𝑖 ∈ Θ𝑖 is known by both players and player 𝑗’s type
remains unknown to player 𝑖, we can modify program 𝐶 𝐾 by letting 𝛼𝑖 (𝜃𝑖 ) > 0 and 𝛼𝑖 (𝜃̃𝑖 ) = 0, ∀𝜃̃𝑖 ∈ Θ𝑖 ⧵ {𝜃𝑖 }.

4.2. Multi-stage Bayesian Game and PBNE


From (7), we can see that at stages 𝑘 < 𝐾, each player optimizes the sum of the immediate utility 𝐽𝑖𝑘 and the
utility-to-go 𝑉𝑖𝑘 . Thus, we can replace the original stage utility 𝐽𝑖𝐾 in program 𝐶 𝐾 with 𝑉𝑖𝑘 + 𝐽𝑖𝑘 in program 𝐶 𝑘 to
compute the DBNE in a multi-stage Bayesian game.
Theorem 2. Given a sequence of beliefs 𝑏𝑘𝑖 for each player 𝑖 ∈ {1, 2} at each stage 𝑘 ∈ {0, 1, ⋯ , 𝐾−1}, a strategy pair
(𝜎1∗,0∶𝐾−1 , 𝜎2∗,0∶𝐾−1 ) constitutes a DBNE of the 𝐾-stage Bayesian game under double-sided incomplete information
with the expected cumulative utility 𝑈𝑖0∶𝐾 in (4), if, and only if 𝜎1∗,𝑘 , 𝜎2∗,𝑘 , 𝑠∗,𝑘
1
(𝑥𝑘 , 𝜃1 ), 𝑠∗,𝑘
2
(𝑥𝑘 , 𝜃2 ) are the optimal
solutions to the following constrained optimization problem 𝐶 𝑘 for each 𝑘 ∈ {0, 1, ⋯ , 𝐾 − 1}:


2 ∑ ∑ ∑ ∑
[𝐶 𝑘 ] ∶ max 𝛼𝑖 (𝜃𝑖 ){𝑠𝑘𝑖 (𝑥𝑘 , 𝜃𝑖 ) + 𝑏𝑘𝑖 (𝜃𝑗 |𝑥𝑘 , 𝜃𝑖 ) 𝜎1𝑘 (𝑎𝑘1 |𝑥𝑘 , 𝜃1 ) 𝜎2𝑘 (𝑎𝑘2 |𝑥𝑘 , 𝜃2 )
𝜎1𝑘 ,𝜎2𝑘 ,𝑠𝑘1 ,𝑠𝑘2 𝑖=1 𝜃𝑖 ∈Θ𝑖 𝜃𝑗 ∈Θ𝑗 𝑎𝑘1 ∈𝐴𝑘1 𝑎𝑘2 ∈𝐴𝑘2

⋅ [𝐽𝑖𝑘 (𝑥𝑘 , 𝑎𝑘1 , 𝑎𝑘2 , 𝜃1 , 𝜃2 ) + 𝑉𝑖𝑘+1 (𝑓 𝑘 (𝑥𝑘 , 𝑎𝑘1 , 𝑎𝑘2 ), 𝜃𝑖 )]}


∑ ∑
s.t. (𝑎) 𝑏𝑘2 (𝜃1 |𝑥𝑘 , 𝜃2 ) 𝜎1𝑘 (𝑎𝑘1 |𝑥𝑘 , 𝜃1 ) ⋅ [𝐽2𝑘 (𝑥𝑘 , 𝑎𝑘1 , 𝑎𝑘2 , 𝜃1 , 𝜃2 ) + 𝑉2𝑘+1 (𝑓 𝑘 (𝑥𝑘 , 𝑎𝑘1 , 𝑎𝑘2 ), 𝜃2 )]
𝜃1 ∈Θ1 𝑎𝑘1 ∈𝐴𝑘1

≤ −𝑠𝑘2 (𝑥𝑘 , 𝜃2 ), ∀𝜃2 ∈ Θ2 , ∀𝑎𝑘2 ∈ 𝐴𝑘2 ,


∑ ∑
(𝑏) 𝑏𝑘1 (𝜃2 |𝑥𝑘 , 𝜃1 ) 𝜎2𝑘 (𝑎𝑘2 |𝑥𝑘 , 𝜃2 ) ⋅ [𝐽1𝑘 (𝑥𝑘 , 𝑎𝑘1 , 𝑎𝑘2 , 𝜃1 , 𝜃2 ) + 𝑉1𝑘+1 (𝑓 𝑘 (𝑥𝑘 , 𝑎𝑘1 , 𝑎𝑘2 ), 𝜃1 )]
𝜃2 ∈Θ2 𝑎𝑘2 ∈𝐴𝑘2

≤ −𝑠𝑘1 (𝑥𝑘 , 𝜃1 ), ∀𝜃1 ∈ Θ1 , ∀𝑎𝑘1 ∈ 𝐴𝑘1 .

Similarly, 𝛼1 (𝜃1 ), 𝛼2 (𝜃2 ) can be any strictly positive and finite numbers, and (𝑠𝑘1 (𝑥𝑘 , 𝜃1 ), 𝑠𝑘2 (𝑥𝑘 , 𝜃2 )) is a sequence of
scalar variables for each 𝑥𝑘 ∈ 𝑋 𝑘 , 𝜃𝑖 ∈ Θ𝑖 , 𝑖 ∈ {1, 2}. The optimum exists and is achieved at the equality of constraints
(𝑎), (𝑏), i.e., 𝑠∗,𝑘 𝑘 𝑘 𝑘
𝑖 (𝑥 , 𝜃𝑖 ) = −𝑉𝑖 (𝑥 , 𝜃𝑖 ), ∀𝜃𝑖 ∈ Θ𝑖 , ∀𝑖 ∈ {1, 2}.

The proof is similar to the one for Theorem 1. The decision variables 𝜎𝑖𝑘 are of size |𝐴𝑘𝑖 | × |𝑋 𝑘 | × |Θ𝑖 |. By letting
stage 𝑘 = 𝐾 and 𝑉𝑖𝐾+1 = 0, program 𝐶 𝐾 for the static Bayesian game is a special case of 𝐶 𝑘 for the multi-stage
Bayesian game. We can solve program 𝐶 𝑘+1 to obtain the DBNE strategy pair (𝜎1𝑘+1 , 𝜎2𝑘+1 ) and the value of 𝑉𝑖𝑘+1 .
Then, we apply 𝑉𝑖𝑘+1 in program 𝐶 𝑘 to obtain a DBNE strategy pair (𝜎1𝑘 , 𝜎2𝑘 ) and the value of 𝑉𝑖𝑘 . Thus, for any given
sequences of type belief pairs 𝑏𝑘𝑖 , ∀𝑖 ∈ {1, 2}, ∀𝑘 ∈ {0, 1, ⋯ , 𝐾}, we can solve 𝐶 𝑘 from 𝑘 = 𝐾 to 𝑘 = 0 recursively
to obtain the DBNE pair (𝜎1∗,0∶𝐾−1 , 𝜎2∗,0∶𝐾−1 ).

4.2.1. PBNE
Given a sequence of beliefs, we can obtain the corresponding DBNE via 𝐶 𝑘 in a backward fashion. However,
given a sequence of policies, both players forwardly update their beliefs at each stage by (2). Thus, we need to find a
consistent pair of belief and policy sequences as required by the PBNE. As summarized in Algorithm 1, we iteratively

L. Huang and Q. Zhu: Preprint submitted to Elsevier Page 11 of 24


Proactive Defense against Advanced Persistent Threats

Algorithm 1: Numerical Solution of 𝜀-PBNE


1 Initialization beliefs 𝑏𝑘𝑖 at each stage 𝑘 ∈ {0, 1, ⋯ , 𝐾}, ITERN UM > 0, 𝜀 ≥ 0.
2 while the 𝑡 <ITERN UM do
3 𝑡 ∶= 𝑡 + 1;
4 for each 𝑥𝐾 ∈ 𝑋 𝐾 do
5 Compute SBNE strategy 𝜎𝑖∗,𝐾 and 𝑉𝑖𝐾 (𝑥𝐾 , 𝜃𝑖 ) via 𝐶 𝐾 .
6 end
7 for 𝑘 ← 𝐾 − 1 to 0 do
8 for each 𝑥𝑘 ∈ 𝑋 𝑘 do
9 Compute DBNE strategy 𝜎𝑖∗,𝑘 and 𝑉𝑖𝑘 (𝑥𝑘 , 𝜃𝑖 ) via 𝐶 𝑘 .
10 end
11 end
12 for 𝑘 ← 0 to 𝐾 − 1 do
13 Update 𝑏𝑘𝑖 with 𝜎𝑖∗,0∶𝐾−1 via (2).
14 end
15 if 𝜎𝑖∗,0∶𝐾−1 , ∀𝑖 ∈ {1, 2}, satisfy (5) then
16 Terminate
17 end
18 Output 𝜀-PBNE strategy pair (𝜎1∗,0∶𝐾−1 , 𝜎2∗,0∶𝐾−1 ) and consistent beliefs 𝑏𝑘𝑖 , ∀𝑘 ∈ {0, ⋯ , 𝐾}.

alternate between the forward belief update and the backward policy computation to find the PBNE. We resort to
𝜀-PBNE solutions when the existence of PBNE is not guaranteed.
Algorithm 1 provides a computational approach to find 𝜀-PBNE with the following procedure. First, both players
initialize their beliefs 𝑏𝑘𝑖 for every state 𝑥𝑘 at stage 𝑘 ∈ {0, 1, ⋯ , 𝐾}, according to their types. Then, they compute
the DBNE strategy pair 𝜎𝑖∗,0∶𝐾 , ∀𝑖 ∈ {1, 2}, under the given belief sequence at each stage by solving program 𝐶 𝑘
from stage 𝐾 to stage 0 in sequence. Next, they update their beliefs at each stage according to the strategy pair
𝜎𝑖∗,0∶𝐾−1 , ∀𝑖 ∈ {1, 2}, via the Bayesian update (2). If the strategy pair 𝜎𝑖∗,0∶𝐾−1 , ∀𝑖 ∈ {1, 2}, satisfies (5) under the
updated belief, we find the 𝜀-PBNE and terminate the iteration. Otherwise, we repeat the backward policy computation
in step two and the forward belief update in step three.

5. Case Study
The model presented in Section 2 can be applied to various APT scenarios. To illustrate the framework, this
section presents a specific attack scenario where the attacker stealthily initiates infection and escalates privileges in the
cyber network, aiming to launch attacks on the physical plant as shown in Fig. 3. Three vertical columns in the left
block illustrate the state transitions across three stages: the initial compromise, the privilege escalation, and the sensor
compromise of a physical system. The red squares at each column represent possible states at that stage. The right
block illustrates a simplified flow chart of the Tennessee Eastman Process. We use the Tennessee Eastman process as
a benchmark of industrial control systems to show that attackers can strategically compromise the SCADA system and
decrease the operational efficiency of a physical plant without triggering the alarm.
In this case study, we adopt the binary type space Θ2 = {𝜃2𝑏 , 𝜃2𝑔 } and Θ1 = {𝜃1𝐻 , 𝜃1𝐿 } for the user and the defender,
respectively. In particular, 𝜃2𝑏 and 𝜃2𝑔 denote the adversarial and legitimate user, respectively; 𝜃1𝐻 and 𝜃1𝐿 denote the
sophisticated and primitive defender, respectively. The bi-matrices in Table 2, 3, and 4 represent both players’ expected
utilities at three stages, respectively. In these matrices, the defender is the row player and the user is the column player.
Each entry of the matrix corresponds to players’ payoffs under their action pairs, types, and the state. In particular, the
two red numbers in the parenthesis before the semicolon are the payoffs of the defender and the user, respectively, under
type 𝜃2𝑏 , while the parenthesis in blue after the semicolon presents the payoff of the defender and the user, respectively,
under type 𝜃2𝑔 .

L. Huang and Q. Zhu: Preprint submitted to Elsevier Page 12 of 24


Proactive Defense against Advanced Persistent Threats

Cyber Network Transition (K=2) Tennessee Eastman Challenge Process

Intelligence Location Privilege Level )(+) + .(+) + /(+) → 2(345)

Sensor
Validity )(+) + .(+) + 6(+) → 7(345)
!& =0
!8 = 0
A Reactor Compressor
!" = 0 Attack
9; 9: !& = 1 Physical
!8 = 1 D
Plant via Recycle
Cyber Condenser Separator flow
!" =1 !& =2 E
Components
!8 = 2
C Stripper

Controller
!& = 3

Intermediate G&H
Initial Stage Final Stage
Stage

Figure 3: The diagram of the cyber state transition (denoted by the left block in orange) and the physical attack on
Tennessee Eastman process via the compromise of the SCADA system (denoted by the right block in blue). APTs
can damage the normal industrial operation by falsifying controllers’ setpoints, tampering sensor readings, and blocking
communication channels to cause delays in either the control message or the sensing data.

5.1. Initial Stage: Phishing Emails


We use a binary set to represent whether the reconnaissance is effectual 𝑥0 = 1 or not 𝑥0 = 0. Effectual recon-
naissance collects essential intelligence that can better support APTs for an initial entry through phishing emails. To
penalize the adversarial exploitation of the open-source intelligence (OSINT) data, the defender can create avatars
(fake personal profiles) on the social network or the company website as shown in Molok, Chang and Ahmad (2010).
At the initial stage of interaction, a user can send emails with non-executable attachments and shortened URLs to
the accounts of entry-level employees, managers, or avatars. These three action options of the user are represented
by 𝑎02 = 0, 1, 2, respectively. Non-executable files such as PDF and MS Office are widely used in organizations
yet an APT attacker can exploit them to execute malicious actions on the victim’s computer. The shortened URL
is created by legitimate service providers such as Google URL shortener yet can redirect to malicious links. The
existing email security mechanisms are not completely effective for identifying malicious PDF files (see Nissim et al.
(2015)) and malicious links behind shortened URLs (see Sahoo, Liu and Hoi (2017)). As a supplement to technical
countermeasures, security training should be emphasized to increase employees’ security awareness and protect them
from web phishing. For example, after receiving suspicious links or attachments with strange names at unexpected
times, the entry-level employee and the manager should be aware of the potential risk and apply extra security measures
such as a digital signature request from the sender before clicking the link or opening the attachment. They should
also be sufficiently alert and report immediately if a PDF does not contain the information that it claims to have. Then
isolation can be applied to prevent the attacker from the potential lateral movement. Since employees’ awareness and
alertness diminish over time, the security training needs to be repeated at reasonable intervals as argued in Mitnick and
Simon (2011), which can be costly. With a limited budget, the defender can choose to educate entry-level employees,
manager-level employees, or no training to avoid the prohibitive training cost 𝑐 0 . These three action options of the
defender are represented by 𝑎01 = 1, 2, 0, respectively. The utility matrix of the initial infection is given in Table 2.
If the user is legitimate, i.e., 𝜃2 = 𝜃2𝑔 , then as denoted in the blue color, he receives an immediate reward 𝑟01 if he
successfully communicates with the employee or the manager by email, but receives a substantial penalty 𝑟0𝑔,𝑓 < 0 if
he emails the avatars because he should not contact a non-existing person. If the user is adversarial, i.e., 𝜃2 = 𝜃2𝑏 , then
as denoted in the red color, he receives an immediate attack reward 𝑟02 if the email receiver does not have proper security
training, but an additional attack cost 𝑟0 if the receiver has been trained properly. The adversarial user receives a faked
reward 𝑟0𝑏,𝑓 > 0 when contacting the avatar, yet arrives at an unfavorable state at stage 𝑘 = 1 and receives few rewards
in the future stages. The training cost and the attack cost are both different for the primitive and the sophisticated

L. Huang and Q. Zhu: Preprint submitted to Elsevier Page 13 of 24


Proactive Defense against Advanced Persistent Threats

Table 2
The expected utilities of the defender and the user at the initial stage, i.e., 𝐽10 and 𝐽20 , respectively.
Email
𝜃2𝑏 ;𝜃2𝑔 Email Managers Email Avatars
Employees
No
(−𝑟02 , 𝑟02 );(0, 𝑟01 ) (−𝑟02 , 𝑟02 );(0, 𝑟01 ) (0, 𝑟0𝑏,𝑓 );(0, 𝑟0𝑔,𝑓 )
Training
Train
(−𝑐 0 , −𝑟0 );(−𝑐 0 , 𝑟01 ) (−𝑐 0 , 𝑟02 );(−𝑐 0 , 𝑟01 ) (−𝑐 0 , 𝑟0𝑏,𝑓 );(−𝑐 0 , 𝑟0𝑔,𝑓 )
Employees
Train
(−𝑐 0 , 𝑟02 );(−𝑐 0 , 𝑟01 ) (−𝑐 0 , −𝑟0 );(−𝑐 0 , 𝑟01 ) (−𝑐 0 , 𝑟0𝑏,𝑓 );(−𝑐 0 , 𝑟0𝑔,𝑓 )
Managers

Table 3
The expected utilities of the defender and the user at the intermediate stage, i.e., 𝐽11 and 𝐽21 , respectively.

𝜃2𝑏 ;𝜃2𝑔 NOP Escalate Privilege

Permit Escalation (0, 0);(0, 0) (−𝑟12 , 𝑟12 );(𝑟11 , 𝑟11 )

Restrict Escalation (0, 0);(0, 0) (𝑟1 , −𝑟1 );(−𝑟11 , −𝑟11 )

defender, i.e., 𝑐 0 ∶= 𝑐𝐿0 ⋅ 𝟏{𝜃1 =𝜃𝐿 } + 𝑐𝐻


0 ⋅𝟏
{𝜃1 =𝜃 𝐻 } and 𝑟 ∶= 𝑟𝐿 ⋅ 𝟏{𝜃1 =𝜃 𝐿 } + 𝑟𝐻 ⋅ 𝟏{𝜃1 =𝜃 𝐻 } . The sophisticated defender
0 0 0
1 1 1 1
holds the security training with a higher frequency, which incurs a higher cost, i.e., 𝑐𝐻
0 > 𝑐 0 , but is also more effective
𝐿
in mitigating web phishing, i.e., 𝑟0𝐻 > 𝑟0𝐿 .

5.2. Intermediate Stage: Privilege Escalation


The state at the intermediate stage can be interpreted as the location of the user where 𝑥1 = 1 refers to the employee’s
computer, 𝑥1 = 2 refers to the manager’s computer, and 𝑥1 = 0 refers to the quarantine area. After the initial access,
the user operates within a process of low privilege. To access certain resources, the user needs to gain higher-level
privileges. An attacker can utilize the process injection to execute malicious code in the address space of a live process
and masquerade as legitimate programs to evade detection as shown in Team (2017). A mitigation method for the
defender is to prevent certain endpoint behaviors that can occur during the process injection. Table 3 presents this
game of privilege escalation.
The user can choose to escalate his privileges, or choose ‘no operation performed (NOP)’. The two action options
are denoted by 𝑎12 = 1 and 𝑎12 = 0, respectively. The defender can choose to either restrict or permit an escalation,
which are denoted by 𝑎11 = 1 and 𝑎11 = 0, respectively. If the legitimate user escalates his privilege and the defender
permits escalation, then both players obtain a reward of 𝑟11 . If the legitimate user escalates his privilege and the defender
restricts escalation, then the efficiency reduction brings a loss of 𝑟11 to both players. On the other hand, if the adversarial
user escalates his privilege and the defender permits escalation, the defender receives a loss of 𝑟12 . If the adversarial
user escalates his privilege and the defender restricts escalation, then the adversarial user has to resort to other attack
techniques which lead to a higher rate of detection. Thus, the defender obtains a reward while the attacker receives an
additional cost. We assume that the reward and the additional cost are both 𝑟1𝐿 if the defender is primitive, and 𝑟1𝐻 if
the defender is sophisticated, i.e., 𝑟1 = 𝑟1𝐿 ⋅ 𝟏{𝜃1 =𝜃 𝐿 } + 𝑟1𝐻 ⋅ 𝟏{𝜃1 =𝜃𝐻 } .
1 1

5.3. Final Stage: Sensor Compromise


The state at the final stage represents four possible privilege levels, denoted by 𝑥2 = {0, 1, 2, 3}, respectively. The
privilege level affects the result of the physical attack at the final stage. The defender’s and the user’s actions, and the
state at the intermediate stage determine the state at the final stage. For example, if the user is at the quarantine area
during the intermediate stage, then he ends up with a level-zero privilege regardless of actions taken by the defender
and himself. Users who take control of the manager’s computer at the intermediate stage can obtain a higher privilege
level than those who start from the entry-level employee’s computer, yet the degree of escalation is reduced if the

L. Huang and Q. Zhu: Preprint submitted to Elsevier Page 14 of 24


Proactive Defense against Advanced Persistent Threats

defender chooses to restrict escalation.


We modify the Simulink model in Bathelt, Ricker and Jelali (2015) to quantify the monetary loss of the Tennessee
Eastman process under sensor compromises. Our attack model of sensor compromise is presented in Section 5.3.2. A
new performance metric to quantify the operational efficiency of the Tennessee Eastman process is proposed in Section
5.3.1 and applied in the game matrix in Section 5.3.3.

5.3.1. Performance Metric


The Tennessee Eastman process involves two irreversible reactions to produce two liquid (liq) products 𝐺, 𝐻 from
four gaseous (g) reactants 𝐴, 𝐶, 𝐷, 𝐸 as shown in the right block of Fig. 3. The control objective is to maintain a desired
production rate as well as quality while stabilizing the whole system under the Gaussian noise to avoid violating safety
constraints such as a high reactor pressure, a high reactor temperature, and a high/low separator/stripper liquid level.
Previous studies on the security of the Tennessee Eastman process have mostly focused on how an attacker can cause
the shortest shutdown time (see Krotofil and Cárdenas (2013)), or a serious violation of a setpoint, e.g., the reactor
pressure exceeds 3, 000 kpa (see Cárdenas, Amin, Lin, Huang, Huang and Sastry (2011)). These attacks successfully
cause the shutdown of the plant and a few days of shutdowns can incur a considerable financial loss. However, the
shutdown also discloses the attack and leads to an immediate patch and a defense strategy update. Thus, it becomes
harder for the same kind of attacks to succeed after the plant recovers from the shutdown.
In our APT scenario, the attacker aims to stealthily decrease the operational efficiency of the plant, i.e., deviate the
normal operation state of the plant without triggering the safety alarm or shutting down the plant. By compromising
the SCADA system and generating fraudulent sensor readings, the attacker can stealthily make the plant operates at a
non-optimal state with reduced utilities. The following economic metrics affect the operational utility of the Tennessee
Eastman process:
• Hourly operating cost 𝐶𝑜 with the unit ($∕ℎ) is taken as the sum of purge costs, product stream costs, compressor
costs, and stripper steam costs.
• Production rate 𝑅𝑝 with the unit (𝑚3 ∕ℎ) is the volume of total products per hour.
• Quality of products 𝑄𝑝 with the unit (𝐺 𝑚𝑜𝑙𝑒%), is the percentage of 𝐺 among total products.

• 𝑃𝐺 with the unit ($∕𝑚3 ) is the price of product 𝐺.


We propose a new performance metric 𝑈𝑇 𝐸 , the per-hour utility to quantify the operational efficiency of the Ten-
nessee Eastman process as follows:

𝑈𝑇 𝐸 = 𝑅𝑝 × 𝑄𝑝 × 𝑃𝐺 − 𝐶𝑜 . (10)

5.3.2. Attack Model


An attack model is characterized by two separate parts, information and capacity. First, the information available
to the attacker such as readings of different sensors can affect the performance of the attack differently. For example,
observing the input rate of the raw material in the Tennessee Eastman process is less beneficial for the attacker than
the direct measurements of 𝑃𝐺 , 𝑅𝑝 , 𝑄𝑝 , 𝐶𝑜 that affect the utility metric in (10). Second, attackers can have different
capacities in accessing and revising controllers and sensors. An attacker may change the parameters of the proportional-
integral-derivative controller, directly falsify the controller output, or indirectly deviate the setpoint by tampering,
blocking or delaying sensor readings.
In this experiment, we assume a reading manipulation of sensor XMEAS(40) and XMEAS(17) in loop 8 and loop
13 of Tennessee Eastman process (see Ricker (1996)), respectively. Sensor XMEAS(40) measures the composition
of component 𝐺 and sensor XMEAS(17) measures the stripper underflow. A higher privilege state 𝑥2 ∈ {0, 1, 2, 3}
means that the user can access more sensors for a longer time, which results in a larger loss and thus a smaller utility
of 𝑟21 (𝑥2 ) to the defender if the user is adversarial. Fig. 4 shows the variation of 𝑈𝑇 𝐸 versus the simulation time under
four different privilege states. We use the time average of these utilities to obtain the normal operational utility 𝑟24
and compromised utilities 𝑟21 (𝑥2 ) under four different privilege states 𝑥2 ∈ {0, 1, 2, 3}. The attacker compromises the
sensor and generates fraudulent readings. The fraudulent reading can be a constant, denoted by the blue line, or a
double of the real readings, denoted by the red or green lines. The pink line represents a composition attack with a
limited control time. Initially, the attacker manages to compromise both sensors by doubling their readings. After the

L. Huang and Q. Zhu: Preprint submitted to Elsevier Page 15 of 24


Proactive Defense against Advanced Persistent Threats

10 5 Sensor Compromise in Loop 13


4

3 Normal Operation
Twofold Reading
Utility ($)

Constant Reading
2

0
0 10 20 30 40 50 60 70
Time (hrs)
10 5 Sensor Compromise in Loop 8 and 13
4

3.5

Attackers lose access Normal Operation


3 to sensor XMEAS(40) Twofold Reading
Utility ($)

Composition Attack

2.5

2 Attackers lose access


to sensor XMEAS(17)

1.5
0 10 20 30 40 50 60 70
Time (hrs)

Figure 4: The economic impact of sensor compromise in the Tennessee Eastman process. The black line represents the
utility of Tennessee Eastman process under the normal operation while the other four lines represent the utility of Tennessee
Eastman process under attacks with four possible privilege levels. We use the time average of these utilities to obtain
the normal operational utility 𝑟24 and compromised utilities 𝑟21 (𝑥2 ), ∀𝑥2 ∈ {0, 1, 2, 3}, under four different states of privilege
levels in Table 4.

attacker loses access to XMEAS(40) at the 6𝑡ℎ hour, the system is sufficiently resilient to recover partially in about
16 hours and achieve the same level of utility as the single attack in green. When the attacker also loses access to
XMEAS(17) at the 36𝑡ℎ hour, the utility goes back to normal in about 13 hours.

5.3.3. Utility Matrix


Attacks against SCADA system can apply command injection attacks to inject false control and compromise sensor
readings as shown in Morris and Gao (2013). Encryption can be introduced to conceal these malicious commands.
However, a legitimate user may also encrypt his communication with the sensor to avoid eavesdropping and enhance
privacy.
Therefore, at the final stage, the user has two options, sends commands to the sensor with or without encryption,
which are denoted by 𝑎22 = 1 and 𝑎22 = 0, respectively. The defender chooses to apply either a complete or selective
monitoring, denoted by 𝑎21 = 1 and 𝑎21 = 0, respectively. The complete monitoring stores all sets of communication

L. Huang and Q. Zhu: Preprint submitted to Elsevier Page 16 of 24


Proactive Defense against Advanced Persistent Threats

Table 4
The expected utilities of the defender and the user at the final stage, i.e., 𝐽12 and 𝐽22 , respectively.

𝜃2𝑏 ;𝜃2𝑔 Unencrypted Command (UC) Encrypted Command (EC)

Selective Monitoring (SM) (𝑟24 , 0);(𝑟24 , 𝑟24 ∕2) (𝑟21 (𝑥2 ), 𝑟24 − 𝑟21 (𝑥2 ));(𝑟24 , 𝑟24 )

Complete Monitoring (CM) (𝑟24 − 𝑐 2 , 0);(𝑟24 − 𝑐 2 , 𝑟24 ∕2) (𝑟2 − 𝑐 2 , −𝑟2 );(𝑟24 − 𝑐 2 , 𝑟24 )

data and analyzes them elaborately to identify malicious commands despite encryption. The selective monitoring
cannot identify malicious commands if they are encrypted. The implementation of the complete monitoring incurs
an additional cost 𝑐 2 compared to the selective one. The last-stage utility matrix of both players is defined in Table
4. If the user is legitimate, as denoted in blue, both the defender and the user can receive a reward of 𝑟4 when the
Tennessee Eastman process operates normally. Legitimate users further receive a utility reduction of 𝑟4 ∕2 for the
potential privacy loss if they choose unencrypted commands. For adversarial users, they send malicious commands
only when the communication is encrypted to evade detection. Thus, if they choose not to encrypt the communication,
they receive 0 utility and the defender receives a reward of 𝑟4 for the normal operation. However, if they choose to
send encrypted malicious commands, both players’ rewards depend on whether the defender chooses the selective
or complete monitoring. If the defender chooses the selective monitoring, then the adversarial user can successfully
compromise the sensor, which results in a reduced utility of 𝑟21 (𝑥2 ). In the meantime, the attacker benefits from the
reward reduction of 𝑟24 − 𝑟21 (𝑥2 ). If the defender chooses the complete monitoring, then the adversarial user suffers a
loss of 𝑟2 for being detected. The detection reward and the implementation cost for two types of defenders are 𝑟2𝐿 , 𝑟2𝐻
and 𝑐𝐿2 , 𝑐𝐻
2 , respectively. Let 𝑟2 ∶= 𝑟2 ⋅ 𝟏
𝐿 {𝜃1 =𝜃 𝐿 } + 𝑟𝐻 ⋅ 𝟏{𝜃1 =𝜃 𝐻 } and 𝑐 ∶= 𝑐𝐿 ⋅ 𝟏{𝜃1 =𝜃 𝐿 } + 𝑐𝐻 ⋅ 𝟏{𝜃1 =𝜃 𝐻 } .
2 2 2 2
1 1 1 1

6. Computation Results
In this section, we apply the algorithms introduced in Section 4 to compute both players’ strategies and utilities at
the equilibrium. We implement our algorithms in MATLAB and use YALMIP (see Löfberg (2004)) as the interface
to call external solvers such as BARON (see Tawarmalani and Sahinidis (2005)) to solve the optimization problems.
We present elaborate results from the concrete case study and provide meaningful insights of the proactive cross-layer
defense against multi-stage APT attacks that are stealthy and deceptive.
For the static Bayesian game at the final stage in Section 6.1, we focus on illustrating how two players’ private types
affect their policies and utilities under different information structures. We further apply sensitivity analysis to show
how the value of the key parameter affects the defender’s and the attacker’s utilities. For the multi-stage Bayesian game
in 6.2, we focus on the dynamic of the belief update and state transition under the interaction of the stealthy attacker
and the proactive defender. Moreover, we investigate how the adversarial and defensive deception, and how the initial
state can affect the stage utility and the cumulative utility of the user and the defender.

6.1. Final Stage and SBNE


Players’ beliefs affect their policies and expected utilities at the final stage. We discuss three different scenarios as
follows. In Fig 5a, the defender does not know the user’s type. In Fig. 6, the user does not know the defender’s type.
In Fig. 5b, both the user and the defender do not know the other’s type. In all three scenarios, the 𝑥-axis represents
the belief of either the user or the defender. The 𝑦-axis of the upper figure represents the probability of either the user
taking action ‘selective monitoring (SM)’ or the defender taking action ‘unencrypted command (UC)’. Fig. 5a shows
the following trends as the user becomes more likely to be adversarial. First, two black lines show that the expected
utility of the defender decreases and the defender is more inclined to apply action ‘complete monitoring’ after her belief
exceeds a threshold. Second, two red lines show that the adversarial user takes action ‘unencrpted command’ with a
higher probability and only gains a reward when the probability of adversarial users is sufficiently small. Thus, we
conclude that when the probability of the adversarial user increases, the defender tends to invest more in cyber defense
so that the attacker behaves more conservatively and inflicts fewer losses. Third, the two blue lines show that the
legitimate user always chooses ‘encrypt command’ and receives a constant utility, which indicates that the proactive
defense does not affect the behavior and the utility of legitimate users at this stage.

L. Huang and Q. Zhu: Preprint submitted to Elsevier Page 17 of 24


Proactive Defense against Advanced Persistent Threats

Defender Legitimate User Adversarial User Defender Legitimate User Adversarial User
Probability of UC/SM

Probability of UC/SM
1 1

0.5 0.5

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

105 105
4 4
Expected Utility

Expected Utility
2 2

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Probability of the Adversarial User Probability of the Adversarial User

(a) The user knows that the defender is primitive, yet the de- (b) Both players’ types are private, and each player only knows
fender only knows the probability of the user being adversarial. the probability of the other player’s type.

Figure 5: The SBNE strategy and the expected utility of the primitive defender and the user who is either legitimate or
adversarial. The 𝑥-axis represents the probability of the user being adversarial. The 𝑦-axis of the upper figure represents
the probability of either the user taking action ‘selective monitoring (SM)’ or the defender taking action ‘unencrypted
command (UC)’.

Attacker Primitive Defender Sophisticated Defender


Probability of UC/SM

0.5

0
0 0.2 0.4 0.6 0.8 1

105
Expected Utility

3.3

3.28

3.26

3.24
0 0.2 0.4 0.6 0.8 1
Probability of the Sophisticated Defender
Figure 6: The SBNE strategy and the expected utility of the adversarial user and the defender who is either primitive
or sophisticated. The defender knows that the user is adversarial while the adversarial user only knows the probability of
the defender being primitive. The 𝑥-axis represents the probability of the defender being sophisticated. The 𝑦-axis of the
upper figure represents the probability of either the user taking action ‘selective monitoring (SM)’ or the defender taking
action ‘unencrypted command (UC)’.

Fig. 6 shows that the defender benefits from introducing defensive deception. When the defender becomes more
likely to a sophisticated one, both types of defenders can have a higher probability to apply the selective monitoring and
save the extra surveillance cost of the complete monitoring. The attacker with incomplete information has a threshold
policy and switches to a lower attacking probability after reaching the threshold of 0.5 as shown in the black line.
When the probability goes beyond the threshold, the primitive defender can pretend to be a sophisticated one and
take action ‘selective monitoring’. Meanwhile, a sophisticated defender can reduce the security effort and take action

L. Huang and Q. Zhu: Preprint submitted to Elsevier Page 18 of 24


Proactive Defense against Advanced Persistent Threats

5 5
10 10
3 3.5

2.5 3

2 2.5

1.5 2

1 1.5

0.5 1

0 0.5
0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 2 2.5
10
5 10 5

Figure 7: Utilities of the primitive defender and the attacker versus the value of 𝑟2𝐿 under different states 𝑥2 ∈ {0, 1, 2, 3}.

‘selective monitoring’ with a higher probability since the attacker becomes more cautious in taking adversarial actions
after identifying the defender as more likely to be sophisticated. It is also observed that the sophisticated defender
receives a higher payoff before the attacker’s belief reaches the 0.5 threshold. After the belief reaches the threshold,
the attacker is threatened to take less aggressive actions, and both types of defenders share the same payoff.
Finally, we consider the double-sided incomplete information where both players’ types are private information,
and each player only has the belief of the other player’s type. Compared with the defender in Fig. 5a who takes action
‘selective monitoring’ with a probability less than 0.5 and receives a decreasing expected payoff, the defender in Fig.
5b can take ‘selective monitoring’ with a probability closed to 1 and receive a constant payoff in expectation after
the user’s belief exceeds the threshold. Thus, the defender can spare defense efforts and mitigate risks by introducing
uncertainties on her type as a countermeasure to the adversarial deception.

6.1.1. Sensitivity Analysis


As shown in Fig. 7, if the value of the penalty 𝑟2𝐿 is close to 0, i.e., the defense at the final stage is ineffective,
then an arrival at state 𝑥2 = 3, the highest privilege level can significantly increase the attacker’s payoff and cause the
most damage to the defender. As more effective defensive methods are employed at the final stage, i.e., the value of 𝑟2𝐿
increases, the attacker becomes more conservative and strategic in taking adversarial behaviors. Then, the state with
the highest privilege level may not be the most favorable state for the attacker.

6.2. Multi-stage and PBNE


We show in Fig. 8 that the Bayesian belief update leads to a more accurate estimate of users’ types. Without the
belief update, the posterior belief is the same as the prior belief in red and is used as the baseline. As the prior belief
increases in the 𝑥-axis, the posterior belief after the Bayesian update also increases in blue. The blue line is in general
above the red line, which means that with the Bayesian update, the defender’s belief becomes closer to the right type.
Also, we find that the belief update is the most effective when an inaccurate prior belief is used as it corrects the
erroneous belief significantly.
In Fig. 9, we show that the proactive defense, i.e., defensive methods in intermediate stages can affect the state
transition and reduce the probability of attackers reaching states that can result in huge damage at the final stage. As
the prior belief of the user being adversarial increases, the attacker is more likely to arrive at state 𝑥2 = 0 and 𝑥2 = 1,
and reduce the probability of visiting 𝑥2 = 2 and 𝑥2 = 3.

6.2.1. Adversarial and Defensive Deception


Fig. 10 investigates the adversarial deception where the attacker takes full control of the defense system and
manipulates the defender’s belief. As shown in the figure, the defender’s utilities all increase when the belief under the
deception approaches the correct belief that the user is adversarial. Also, the increase is stair-wise, i.e., the defender
only alternates her policy when the manipulated belief is beyond certain thresholds. Under the same manipulated
belief, a sophisticated defender benefits no less than a primitive one. The defender receives a lower payoff when the

L. Huang and Q. Zhu: Preprint submitted to Elsevier Page 19 of 24


Proactive Defense against Advanced Persistent Threats

Bayeisan Belief Update Without Belief Update


1

0.5

Posterior Belief of Adversarial User


0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1

0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1

0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1

0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Prior Belief of Adversarial User

Figure 8: The defender’s prior and posterior beliefs of the user being adversarial.

Bayeisan Belief Update Without Belief Update


1

0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1
State Probability

0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1

0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1

0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Prior Belief of Adversarial User

Figure 9: The probability of different states 𝑥2 ∈ {0, 1, 2, 3}.

reconnaissance provides effectual intelligence.


Incapable of revealing the adversarial deception completely, the defender can alternatively introduce defensive
deceptions, e.g., a primitive defender can disguise himself as a sophisticated one to confuse the attacker. Defensive
deceptions introduce uncertainties to attackers, increase their costs, and increase the defender’s utility. Fig. 11 investi-
gates the defender’s and the attacker’s utilities under three different scenarios. The complete information refers to the
scenario where both players know the other player’s type. The deception with the 𝐻-type or the 𝐿-type means that the
attacker knows the defender’s type to be sophisticated or primitive, respectively, yet the defender has no information
about the user’s type. The double-sided deception indicates that both players do not know the other player’s type.

L. Huang and Q. Zhu: Preprint submitted to Elsevier Page 20 of 24


Proactive Defense against Advanced Persistent Threats

10 5
3

1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

System Utility under Deception


10 5
3

1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
10 5
3

1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
10 5
3

1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Belief of Adversarial User under Deception

Figure 10: The defender’s utility under deceived beliefs.

The results from Fig. 11 are summarized as follows. First, the sophisticated defender’s payoffs can increase as much
as 56% than those of the primitive defender. Also, a prevention of effectual reconnaissance increases the defender’s
utility by as much as 41% and reduces the attacker’s utility by as much as 38%. Second, the defender and the attacker
receive the highest and the lowest payoff, respectively, under the complete information. When the attacker introduces
deceptions over his type, the attacker’s utility increases and the defender’s utility decreases. Third, when the defender
adopts defensive deceptions to introduce double-sided incomplete information, we find that the decrease of the sophis-
ticated defender’s utilities is reduced by at most 64%, i.e., changes from $55, 570 to $35, 570 when the reconnaissance
is effectual. The double-sided incomplete information also brings lower utilities to the attacker than the one-sided
adversarial deception. However, the defender’s utility under the double-sided deception is still less than the complete
information case, which concludes that acquiring complete information of the adversarial user is the most effective
defense. However, if the complete information cannot be obtained, the defender can mitigate her loss by introducing
defensive deceptions.

7. Discussions and Conclusions


Advanced Persistent Threats (APTs) are emerging security challenges for cyber-physical systems as the attacker
can stealthily enter, persistently stay in, and strategically interact with the system. In this work, we have developed a
game-theoretic framework to design proactive and cross-layer defenses for cyber-physical systems in a holistic manner.
Dynamic games of incomplete information have been used to capture the long-term interaction between users and
defenders who have private information unknown to the other player. Each player forms a belief on the unknowns and
uses the Bayesian update to learn the private information and reduce uncertainty. The analysis of the Perfect Bayesian
Nash Equilibrium (PBNE) has provided the defender with an effective countermeasure against the stealthy strategic
attacks at multiple stages. To compute the PBNE of the dynamic games, we have proposed a nested algorithm that
iteratively alternates between the forward belief update and the backward policy computation. The algorithm has been
shown to quickly converge to the 𝜀-PBNE that yields a consistent pair of beliefs and policies.
Using the Tennessee Eastman process as a case study of industrial control systems, we have shown that the proactive
multi-stage defense in cyber networks can successfully mitigate the risk of physical attacks without reducing the payoffs
of legitimate users. In particular, experiment results show that a sophisticated defender receives a payoff up to 56%
higher than a primitive defender does. Also, it has been illustrated that by preventing effectual reconnaissance, the
defender increases her utility and reduces the attacker’s utility by at most 41% and 38%, respectively. On one hand, the

L. Huang and Q. Zhu: Preprint submitted to Elsevier Page 21 of 24


Proactive Defense against Advanced Persistent Threats

330000
310000

Defender's Utility ($)


290000
270000
250000
230000
210000
190000
170000
150000
Ineffectual Reconnaissance Effectual Reconnaissance

Complete Information with the H-Type Complete Information with the L-Type
Complete Information with H-Type Complete Information with L-Type
Deception with the H-Type Deception with the L-Type
Deception with H-Type Deception with L-Type
Double Deception with the H-Type Double Deception with the L-Type
Double Deception with H-Type Double Deception with L-Type
180000
160000
Attacker's Utility ($)

140000
120000
100000
80000
60000
40000
Ineffectual Reconnaissance Effectual Reconnaissance

Figure 11: The cumulative utilities of the attacker and the defender under the complete information, the adversarial
deception, and the defensive deception. In the legend, the left three represent the utilities for a sophisticated defender and
the right three represent the ones for a primitive defender.

attacker receives a higher payoff after introducing the adversarial deception as it increases the defender’s uncertainties
on the user’s type. On the other hand, by creating uncertainties for attackers, the defender can successfully threaten
them to take more conservative behaviors and become less motivated to launch attacks. It has been shown that the
defender can significantly benefit from the mitigation of attack losses when he adopts defensive deceptions.
The main challenge of our approach is to identify the utility and feasible actions of defenders and users at each stage.
One future direction to reduce the complexity of the model description is to develop mechanisms that can automate
the synthesis of verifiably correct game-theoretic models. It would alleviate the workload of the system defender and
operator. Nevertheless, game theory provides a quantitative and explainable framework to design the proactive defen-
sive response under uncertainty compared to rule-based and machine-learning-based defense methods, respectively.
Besides, the rule-based defense is static, thus an attack can circumvent it through sufficient effort. Machine learning
methods require a lot of labeled data sets which may be hard to obtain in the APT scenario. Second, we have proposed
the belief to quantify the uncertainty which results from players’ private types. The belief is continuously updated to
reduce uncertainties and provide a probabilistic detection system as a byproduct of the APT response design. Third,
our approach enables the defender to evaluate the multi-stage impact of her defense strategies to both legitimate and
adversarial users when adversarial and defensive deceptions present at the same time. Based on the evaluation, defend-
ers can further find revised countermeasures and design new game rules to achieve a better tradeoff between security
and usability. Our model can be broadly applied to scenarios in artificial intelligence, economy, and social science
where multi-stage interactions occur between multiple agents with incomplete information. Multi-sided non-binary
types can be defined based on the scenario, and our iteration algorithm of the forward belief update and the backward
policy computation can be extended for efficient computations of the perfect Bayesian Nash equilibrium. The future

L. Huang and Q. Zhu: Preprint submitted to Elsevier Page 22 of 24


Proactive Defense against Advanced Persistent Threats

work would extend the framework to an 𝑁-person game to characterize the simultaneous interactions among multiple
users and model composition attacks. We would also consider scenarios where players’ actions and the system state
are partially observable.

References
Bathelt, A., Ricker, N.L., Jelali, M., 2015. Revision of the Tennessee Eastman process model. IFAC-PapersOnLine 48, 309 – 314. doi:https:
//doi.org/10.1016/j.ifacol.2015.08.199. 9th IFAC Symposium on Advanced Control of Chemical Processes ADCHEM 2015.
Cárdenas, A.A., Amin, S., Lin, Z.S., Huang, Y.L., Huang, C.Y., Sastry, S., 2011. Attacks against process control systems: risk assessment, detection,
and response, in: Proceedings of the 6th ACM symposium on information, computer and communications security, ACM. pp. 355–366.
Corporation, T.M., 2019. Enterprise matrix. URL: https://attack.mitre.org/matrices/enterprise/.
Dufresne, M., 2018. Putting the MITRE ATT&CK evaluation into context. URL: https://www.endgame.com/blog/technical-blog/
putting-mitre-attck-evaluation-context.
Feng, X., Zheng, Z., Hu, P., Cansever, D., Mohapatra, P., 2015. Stealthy attacks meets insider threats: a three-player game model, in: MILCOM
2015-2015 IEEE Military Communications Conference, IEEE. pp. 25–30.
FireEye, 2017. Advanced Persistent Threat Groups | FireEye. URL: https://www.fireeye.com/current-threats/apt-groups.html.
Friedberg, I., Skopik, F., Settanni, G., Fiedler, R., 2015. Combating advanced persistent threats: From network event correlation to incident
detection. Computers & Security 48, 35–57.
Ghafir, I., Hammoudeh, M., Prenosil, V., Han, L., Hegarty, R., Rabie, K., Aparicio-Navarro, F.J., 2018. Detection of advanced persistent threat
using machine-learning correlation analysis. Future Generation Computer Systems 89, 349–359.
Ghafir, I., Kyriakopoulos, K.G., Lambotharan, S., Aparicio-Navarro, F.J., AsSadhan, B., BinSalleeh, H., Diab, D.M., 2019. Hidden markov models
and alert correlations for the prediction of advanced persistent threats. IEEE Access 7, 99508–99520.
Ghafir, I., Prenosil, V., Hammoudeh, M., Han, L., Raza, U., 2017. Malicious ssl certificate detection: A step towards advanced persistent threat
defence, in: Proceedings of the International Conference on Future Networks and Distributed Systems, ACM. p. 27.
Harsanyi, J.C., 1967. Games with incomplete information played by “Bayesian” players, i–iii part i. the basic model. Management science 14,
159–182.
Department of Homeland Security, D., 2018. NSA/CSS Technical Cyber Threat Framework v2 A REPORT FROM: CYBERSECURITY OPERA-
TIONS THE CYBERSECURITY PRODUCTS AND SHARING DIVISION. Technical Report. URL: https://www.nsa.gov/Portals/70/
documents/what-we-do/cybersecurity/professional-resources/ctr-nsa-css-technical-cyber-threat-framework.pdf.
Horák, K., Zhu, Q., Bošanskỳ, B., 2017. Manipulating adversaryâĂŹs belief: A dynamic game approach to deception by design for proactive
network security, in: International Conference on Decision and Game Theory for Security, Springer. pp. 273–294.
Huang, L., Chen, J., Zhu, Q., 2017. A large-scale markov game approach to dynamic protection of interdependent infrastructure networks, in:
International Conference on Decision and Game Theory for Security, Springer. pp. 357–376.
Huang, L., Zhu, Q., 2018. Analysis and computation of adaptive defense strategies against advanced persistent threats for cyber-physical systems,
in: International Conference on Decision and Game Theory for Security, Springer. pp. 205–226.
Huang, L., Zhu, Q., 2019a. Adaptive honeypot engagement through reinforcement learning of semi-markov decision processes. CoRR
abs/1906.12182. URL: http://arxiv.org/abs/1906.12182, arXiv:1906.12182.
Huang, L., Zhu, Q., 2019b. Adaptive strategic cyber defense for advanced persistent threats in critical infrastructure networks. ACM SIGMETRICS
Performance Evaluation Review 46, 52–56.
Hutchins, E.M., Cloppert, M.J., Amin, R.M., 2011. Intelligence-driven computer network defense informed by analysis of adversary campaigns
and intrusion kill chains. Leading Issues in Information Warfare & Security Research 1, 80.
Krotofil, M., Cárdenas, A.A., 2013. Resilience of process control systems to cyber-physical attacks, in: Nordic Conference on Secure IT Systems,
Springer. pp. 166–182.
La, Q.D., Quek, T.Q., Lee, J., Jin, S., Zhu, H., 2016. Deceptive attack and defense game in honeypot-enabled networks for the internet of things.
IEEE Internet of Things Journal 3, 1025–1035.
Li, P., Yang, X., Xiong, Q., Wen, J., Tang, Y.Y., 2018. Defending against the advanced persistent threat: An optimal control approach. Security and
Communication Networks 2018.
LLC, P.I., 2018. 2018 cost of data breach study.
Löfberg, J., 2004. Yalmip : A toolbox for modeling and optimization in MATLAB, in: In Proceedings of the CACSD Conference, Taipei, Taiwan.
Marchetti, M., Pierazzi, F., Colajanni, M., Guido, A., 2016. Analysis of high volumes of network traffic for advanced persistent threat detection.
Computer Networks 109, 127–141.
Messaoud, B.I., Guennoun, K., Wahbi, M., Sadik, M., 2016. Advanced persistent threat: New analysis driven by life cycle phases and their
challenges, in: 2016 International Conference on Advanced Communication Systems and Information Security (ACOSIS), IEEE. pp. 1–6.
Milajerdi, S.M., Kharrazi, M., 2015. A composite-metric based path selection technique for the tor anonymity network. Journal of Systems and
Software 103, 53–61.
Mitnick, K.D., Simon, W.L., 2011. The art of deception: Controlling the human element of security. John Wiley & Sons.
Molok, N.N.A., Chang, S., Ahmad, A., 2010. Information leakage through online social networking: Opening the doorway for advanced persistence
threats .
Morris, T.H., Gao, W., 2013. Industrial control system cyber attacks, in: Proceedings of the 1st International Symposium on ICS & SCADA Cyber
Security Research, pp. 22–29.
Nguyen, T.H., Wang, Y., Sinha, A., Wellman, M.P., . Deception in finitely repeated security games.
Nissim, N., Cohen, A., Glezer, C., Elovici, Y., 2015. Detection of malicious pdf files and directions for enhancements: A state-of-the art survey.
Computers & Security 48, 246–266.

L. Huang and Q. Zhu: Preprint submitted to Elsevier Page 23 of 24


Proactive Defense against Advanced Persistent Threats

Pawlick, J., Chen, J., Zhu, Q., 2018. istrict: An interdependent strategic trust mechanism for the cloud-enabled internet of controlled things. arXiv
preprint arXiv:1805.00403 .
Pawlick, J., Colbert, E., Zhu, Q., 2017. A game-theoretic taxonomy and survey of defensive deception for cybersecurity and privacy. arXiv preprint
arXiv:1712.05441 .
Ricker, N.L., 1996. Decentralized control of the tennessee eastman challenge process. Journal of Process Control 6, 205–221.
Rowe, N.C., Custy, E.J., Duong, B.T., 2007. Defending cyberspace with fake honeypots .
Sahoo, D., Liu, C., Hoi, S.C., 2017. Malicious url detection using machine learning: a survey. arXiv preprint arXiv:1701.07179 .
Shoham, Y., Leyton-Brown, K., 2008. Multiagent systems: Algorithmic, game-theoretic, and logical foundations. Cambridge University Press.
Sigholm, J., Bang, M., 2013. Towards offensive cyber counterintelligence: Adopting a target-centric view on advanced persistent threats, in: 2013
European Intelligence and Security Informatics Conference, IEEE. pp. 166–171.
Tawarmalani, M., Sahinidis, N.V., 2005. A polyhedral branch-and-cut approach to global optimization. Mathematical Programming 103, 225–249.
Team, M.D.A.R., 2017. Detecting stealthier cross-process injection techniques with windows defender atp: Process hollowing and atom bombing.
URL: https://bit.ly/2nVWDQd.
Van Dijk, M., Juels, A., Oprea, A., Rivest, R.L., 2013. Flipit: The game of “stealthy takeover”. Journal of Cryptology 26, 655–713.
Yang, L.X., Li, P., Zhang, Y., Yang, X., Xiang, Y., Zhou, W., 2018. Effective repair strategy against advanced persistent threat: A differential game
approach. IEEE Transactions on Information Forensics and Security 14, 1713–1728.
Zhang, M., Zheng, Z., Shroff, N.B., 2015. A game theoretic model for defending against stealthy attacks with limited resources, in: International
Conference on Decision and Game Theory for Security, Springer. pp. 93–112.
Zhu, Q., Rass, S., 2018. On multi-phase and multi-stage game-theoretic modeling of advanced persistent threats. IEEE Access 6, 13958–13971.

L. Huang and Q. Zhu: Preprint submitted to Elsevier Page 24 of 24

You might also like