1 - A - Automate - Cybersecurity - Data - Triage - by - Leveraging - Human - Analysts - Cognitive - Process-5

This document discusses automating cybersecurity data triage by leveraging human analysts' cognitive processes. It proposes mining data triage rules from analysts' operation traces and using these rules to construct automated data triage systems. The approach was evaluated in a case study where it constructed a data triage state machine from 29 analysts' traces and achieved good results based on false positive/negative rates.

Uploaded by

Stephen Njeri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

82 views7 pages

1 - A - Automate - Cybersecurity - Data - Triage - by - Leveraging - Human - Analysts - Cognitive - Process-5

Uploaded by

Stephen Njeri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

2016 IEEE 2nd International Conference on Big Data Security on Cloud, IEEE International Conference on High Performance

and Smart Computing, IEEE International Conference on Intelligent Data and Security

Automate Cybersecurity Data Triage by Leveraging

Human Analysts’ Cognitive Process
Chen Zhong, John Yen, Peng Liu Robert F. Erbacher
Pennsylvania State University Army Research Laboratory
{czz111, jyen, pliu}@ist.psu.edu [email protected]

Abstract—Security Operation Centers rely on data triage to Human analysts are relied upon heavily to identify the true
identify the true “signals” from a large volume of noisy alerts “signals” from a large volume of noisy IDS alerts and “connect
and “connect the dots” to answer certain higher-level questions the dots” to answer some higher-level questions about the
about the attack activities. This work aims to automatically
generate data triage automatons directly from cybersecurity attack activities, such as “whether the network is under an
analysts’ operation traces. Existing methods for generating data attack?”, “what did the attackers do?”, and “what might be
triage automatons, including Security Information and Event their next steps?”. They need to conduct a series of analysis,
Management systems (SIEMs), require event correlation rules to including data triage, escalation analysis, correlation analysis,
be generated by dedicated manual effort from expert analysts. threat analysis, incident response and forensic analysis [1].
To save analysts’ workloads, we propose to “mine” data triage
rules out of cybersecurity analysts’ operation traces and to use Data triage encompasses examining the details of a variety
these rules to construct data triage automatons. Our approach of data sources (e.g., IDS alerts, firewall logs, OS audit trails,
may make the cost (of data triage automaton generation) orders vulnerability reports, and packet dumps), weeding out the false
of magnitudes smaller. We have designed and implemented positives, grouping the related indicator data entries according
the new system and evaluated it through a human-in-the-loop to the corresponding attack campaigns. Data triage is the most
case study. The case study shows that our system can use the
analysts’ operation traces as input and automatically generate a fundamental stage in cyber operations. It provides a basis for
corresponding state machine for data triage. The operation traces closer inspection in the following analysis to finally generate
were collected in our previous lab experiment. 29 professional confidence-bounded attack incident reports. These incident
cybersecurity analysts were recruited to analyze a set of IDS reports will serve as the primary basis for further decision-
alerts and firewall logs. False positive and false negative rates making regarding how to change current security configuration
were calculated to evaluate the performance of the data triage
state machine by comparing with the ground truth. and act against the attacks. In practice, analysts can leverage
SIEM systems to get part of the data triage completed.
I. I NTRODUCTION Data triage is labor-intensive and mostly performed man-
ually by analysts. It’s the state of the practice but restricts
Due to the cannot-be-underestimated risk caused by cyber-
the efficiency of the analysts generating high-quality incident
attacks, nowadays the stake of protecting a mission-critical
reports within a limited time. Compared to a computer, human
or business-critical enterprise network is already so high
brains have orders of magnitudes smaller data processing
that many prominent companies, governments, and military
throughput. In addition, human beings face unique challenges
departments must deploy a human-in-the-loop cyber defense
such as fatigue, anxiety and depression, which a computer
system. Typically, they usually set up a Security Operations
would never face. The data, coming from a variety of data
Center (SOC) to do 24*7 monitoring, intrusion detection, and
sources, are being continuously generated, and thus the volume
diagnosis (on what is actually happening). In a military set-
is overwhelming. The alert portion of the data contains a
ting, CNDSP (Computer Network Defense Service Provider)
large number of false positives. The analysts have to leverage
centers have already been established and operating for quite
their domain expertise and experience to make fast judgments
a few years. SOCs employ multiple automated measures
regarding which parts of the data are worthy of further
to detect cyberattacks, including traffic monitors, firewalls,
analysis. The time pressure is typically huge. The analysts have
vulnerability scanners, Intrusion Detection/Prevention System
to resort to day shift to night shift transitions to achieve 24*7
(IDS/IPS), alert correlation, and Security Information and
coverage. Due to these challenges, it’s usually very difficult
Event Management (SIEM). However, the most sophisticated
for analysts to generate incident reports in a SOC.
cyberattack strategies and malware are in fact beyond what
This paper aims to take the first step towards automating the
these automated attack-analysis measures can handle. Human
human-centric data triage through a new angle. The question
intelligence, provided by cybersecurity analysts, is playing a
to ask is whether analysts’ intelligence can be elicited from
critical role in SOCs in “comprehending” the sophisticated
their operations in performing data triage tasks, and whether
attack strategies through advanced correlated diagnosis.
the elicited intelligence can be leveraged to build automated
This work was supported by ARO W911NF-15-1-0576, ARO W911NF- system to ease the burden on analysts.
13-1-0421 (MURI), and NIETP CAE Cybersecurity Grant. Extensive research has been conducted to assist cybersecu-

978-1-5090-2403-2/16 $31.00 © 2016 IEEE 357

DOI 10.1109/BigDataSecurity-HPSC-IDS.2016.41

Authorized licensed use limited to: University of Obuda. Downloaded on March 28,2022 at 18:55:07 UTC from IEEE Xplore. Restrictions apply.
Fig. 1. A scenario of data triage process demonstrates an analyst ﬁltering and correlating the network connection events based on characteristic constrains.

rity analysts. Alert correlation [2], is an important milestone of magnitudes smaller. (c) The automatic data triage system
in automated data triage. Alert correlation techniques use is evaluated through a human-in-the-loop case study. With
heuristic rules (a simple form of automaton) to correlate alerts. IRB approved, we’ve recruited 29 professional cybersecurity
However, it is limited by analyzing only one data source (i.e., analysts and captured their traces of operations to construct a
IDS alerts) while analysts must do cross-data-source analysis DT-SM. (d) We’ve evaluated the effectiveness of the DT-SM
in most cases. Alert correlation analysis has later on been construction and the data triage results based on false positive
integrated with other analysis [3], but the method is still along and false negative rates.
the heuristic rule angle. Motivated by the benefits of cross- II. C YBERSECURITY DATA T RIAGE
data-source analysis, SIEM systems (e.g., ArcSight [4]) have
A. A Data Triage Scenario
been focusing on security event correlations across multiple
data sources. Although SIEM systems take a big leap forward Data triage is the most fundamental step in cyber defense
in generating more powerful data triage automatons, SIEM operations [1]. Data triage analysis usually involves examining
systems require expert analysts to spend dedicated manual the details of alerts and various reports, filtering the data
effort on developing the data triage automatons [21], in order sources of interest for further in-depth analysis, and correlating
to generate a large number of complicated rules (complicated the relevant data. Fig. 1 shows a scenario of data triage process,
data triage automatons). Therefore, there is a need to make in which analysts perform data triage operations on a set of
SIEM systems more economically sustainable. given data sources to discover attack paths.
In this example (Fig. 1), two attackers attack a network via
This new human-centric angle sets our work apart from
different attack paths. An attack path is a series of steps and
the existing works on generating data triage automatons. Our
each step is performed on the basis of previous steps with the
approach is enabled by three insights. Firstly, it is actually
goal of reaching to a final target. The data sources collected
possibly to do non-intrusive tracing of human analysts’ con-
by the sensors reporting the details of network connection
crete data triage operations. Secondly, though challenging, the
can be generalized to the network connection events (will
operation traces could be analyzed in a largely automated
be defined in detail below). A network connection event can
way. Thirdly, though challenging, the result of analyzing the
be either a malicious event belonging to an attack path or a
traces could be automatically transformed into a state machine,
normal event. An analyst performs data triage operations to
which is essentially a data triage automaton. Accordingly,
filter the network connection events to make fast decisions
our approach works as follows. First, an analyst’s operation
about which connection events could be related to an attack
trace can be collected by an existing cognitive tracing toolkit
path and worth further investigation. Each analyst’s data triage
while he/she is performing a data triage task. Second, we
operation filters the data based on a condition on the network
use a graph to represent the analyst’s data triage operations
connection events. These are critical components in a data
and their temporal and logical relationships. Third, we mine
triage system and are explained as follows.
the constructed graph for patterns. Each pattern defines a
class of network connection events belonging to a multistage B. Network Connection Event
attack. Fourth, a data triage state machine (DT-SM) can be In this paper, the attack paths are considered at the abstrac-
constructed based on the patterns for automating data triage. tion level of network connection event. A network connection
The main contributions are as follows. (a) To the best of event (written in short as “connection event” ) is defined by a
our knowledge, this is the first method that generates data tuple with 8 attributes,
triage automatons directly from human analysts’ operation e =< time, conn, ipsrc , portsrc , ipdst , portdst , prot, msg >,
traces in an automatic manner. (b) Compared to the existing
data triage automaton generation methods, our method may where time is the time when the connection activity is per-
make the cost (of data triage automaton generation) orders formed, conn is the type of connection activity (a connection

358

Authorized licensed use limited to: University of Obuda. Downloaded on March 28,2022 at 18:55:07 UTC from IEEE Xplore. Restrictions apply.
can be “Built”, “Teardown”, or “Deny”), ipsrc , portsrc and III. T HE AUTOMATIC DATA T RIAGE A PPROACH
ipdst , portdst are the IP address and port of the source and To capture and leverage human intelligence in data triage, our
destination respectively, prot is the connection protocol type, approach includes four steps. (1) We capture the traces of analysts’
and msg is the message associated with the connection. A operations in performing data triage tasks. Each data triage operation
filters the data based on a characteristic constraint that specifies
successful TCP communication session usually is created by a subset of connection events. (2) A CC-Graph is constructed to
a “Built” connection event and ended with a “Teardown” represent each analyst’s data triage operations and their logical and
connection event. “Deny” refers to a failed connection attempt. temporal relationships. (3) We mine the CC-Graphs to construct the
An attack path is represented as a finite sequence of connection attack path patterns, each of which specifies a class of connection
events (e1 , . . . , en ). events as the traces of a multistage attack. (4) We use these attack
path patterns to construct a state machine for automated data triage.

C. Analysts’ Data Triage Operation A. Step 1: Identify Data Triage Operations

A toolkit, named ARSCA, has been developed for non-intrusively
Analysts conduct operations to filter suspicious network tracing an analyst’s operations in cyberattack analysis [8]. An analyst
connection events in the data sources. According to the study can conduct FILTER, SEARCH, and SELECT operations when
on analysts’ daily operations in data triage [1], these operations he/she is performing a data triage task in the environment of ARSCA.
ARSCA has been tested in an experiment with real analysts [7].
can be performed in three ways: to filter based on a condition,
Without reinventing the wheels, we direct adopt ARSCA in this
to search using a keyword and to directly select a portion work to capture the traces of analysts’ data triage operations. Once
of connection events with common characteristics. All these the analyst finishes his/her data analysis task, ARSCA outputs a
operations are named “Data Triage Operation”. The three types log file, in which the items in the log are in the structural for-
are explained as follows. mat: <Item Timestamp=[TIMESTAMP]> [ACTION_TYPE]
([CONTENT]) </item>. It contains the detailed description of
• FILTER(Din , Dout , C): Filter a set of connection events an operation and a time stamp that indicates when the operation
(Din ) based on a condition (C) on the attribute values of was performed. TABLE I shows the items in a ARSCA log that
the connection events and result in a subset (Dout ). corresponds to the data triage operations.
• SEARCH(Din , Dout , C): Search a keyword (C) in a set B. Step 2: Construct Characteristic Constraint Graph
of connection events (Din ). It results in a subset (Dout ). Analysts perform each data triage operation based on a certain
• SELECT(Din , Dout , C),Dj ⊆ Di : Select a subset of characteristic constraint (CC). The logic between the data triage
connection events (Dout ) in a set of connection events operations illustrates the analyst’s underlying data triage strategy.
(Din ), the subset of connection events (Dout ) has a 1) Relationship between Characteristic Constraints: Let
O1 = (t1 , fC1 (D)) and O2 = (t2 , fC2 (D)) be two different data
characteristic (C).
triage operations performed on a set of connection events D, we
1) Characteristic Constraint: Each data triage operation define three types of logical relationships between them based on
filters out a subset of connection events from the original set of their characteristic constraints: “is-equal-to”, “is-subsumed-by” and
“is-complementary-with”.
connection events specified in the provided data sources. The
characteristics of the subset connection events are determined isEql(O1 , O2 ) ⇐⇒ C1 ↔ C2 ,
by the constraint defined in the data triage operation. An isSub(O1 , O2 ) ⇐⇒ C1 → C2 ,
atomic constraint specifies the value range of an attribute of
the connection event, isCom(O1 , O2 ) ⇐⇒ C1 → ¬C2 and C2 → ¬C1 ,
where isEql(·, ·) and isCom(·, ·) are bidirectional relationships but
Ti = rv (attr, val), isSub(·, ·) is a unidirectional relationship.
2) Characteristic Constraint Graph (CC-Graph): Com-
where rv is the relationship between values, rv = {=, <>, > bining the logical relationships with the temporal relationships,
, <, <=, >=}. an analyst’s triage process can be modeled as a directed graph,
A constraint on the subset characteristics can be multidimensional named “Characteristic Constraint Graph (CC-Graph)”, G(trace) =<
because a connection event has multiple attributes. It’s represented V, Rl >, V = {O1 , . . . , On }, Rl ⊆ V ∗ V, l ∈ {isEql t , isSub t
by a predicate in disjunctive normal form, named “Characteristic , isCom t }. The nodes of the graph are the data triage operations.
Constraint (CC)”, A directed edge between two nodes represents a conjunction of a
constraint-related relationship and a “happen-after” relationship (i.e.
isEql t , isSub t , isCom t ), and the label of the edge indicates
C= (T1 ∧ . . . ∧ Tn ),
the types of relationships.
The reason why edges of CC-Graph are defined with both the
where Ti (1 ≤ i ≤ n) is an atomic predict. Let C be a characteristic temporal and logical relationship is because we are especially inter-
constraint specified in a data triage operation, and let D be a set of ested in how an operation is related to the operations that happened
connection events reported in the data sources. A data triage operation before it. If an analyst found the subset noteworthy, he/she may
is represented by a 2-tuple, generate several hypotheses about the possible attack path based on
the observation. To further investigate these hypotheses, the analyst
O1 = (t, fC ), may perform further data triage operation to narrow down the search
space. Therefore, the sequence of data triage operations may imply
where t is the timestamp of the analyst performing this operation, the changes of the analyst’s focus of attention.
fC : P (D) → P (D), such that fC (D) = {ek |ek ∈ D, C holds on As an example, Fig. 2 is a partial CC-Graph, which is constructed
ek }, and P (D) is the power set of D. based on an analyst’s trace. The complete graph includes 32 nodes

359

Authorized licensed use limited to: University of Obuda. Downloaded on March 28,2022 at 18:55:07 UTC from IEEE Xplore. Restrictions apply.
TABLE I and 261 edges. Node 8 is a data triage operation using the constraint
A N EXAMPLE OF ARSCA LOG “DstPort=6667”, which resulted in a subset of data which were iden-
tified by the analyst as suspicious events (represented in red color).
Oper- After that, the analyst then screened out the connection events with
Examples in ARSCA Log Description
ation destination port 6667 by conducting another data triage operation
<Item Timestamp=07/31 (i.e. Node 9). It may indicate that the analyst switched his attention
13:01:41> Filter the a set to the unexplored connection events with other characteristics after
FILTER( of connection investigating the connection events via destination port 6667. The
SELECT * FROM events (i.e.IDS following data triage operations of Node 13 and Node 14 have an
FILTER
IDS Alerts Alerts) based on “is-subsumed-by” relationship with the data triage operation of Node
WHERE a condition (i.e. 9. It means the analyst gradually screened out the connection events
SrcPort = 6667 SrcPort=6667) to narrow down the scope. The data triage operation of Node 14 is
</Item> the end of this narrowing process and it lets the analyst generate
<Item Timestamp=07/31 another hypothesis. Similarly, the sequence of Node 9, Node 15 and
13:20:29> Node 16 indicates the same strategy used by the analyst.
SELECT (
FIREWALL-[4/5/2012 C. Step 3: Mine the Characteristic Constraint Graphs
10:15 PM]- A CC-Graph contains rich information about the strategy used by
[Built]-[TCP] an analyst in data triage, which can answer the following questions.
(172.23.233.57/3484, (1) Given a set of data triage operations, which one is critical
10.32.5.58/6667), for the analyst to perform effective triage? (2) Given the critical
FIREWALL-[4/5/2012 data triage operations, what sequence these operations should be
10:15 PM]- performed to detect the events of an attack path? Given a CC-Graph
Filter the a set
[Teardown]-[TCP]
Select a set of G =< V, Rl >, two steps are taken to elicit such key characteristic
(172.23.233.52/5694, constraints from the CC-Graphs.
connection events
SELECT
10.32.5.59/6667),
(i.e. the underl- 1) Extract the Critical Endpoints in “isSub” Subgraphs:
FIREWALL-[4/5/2012 Given data triage operations O1 and O2 , isSub t (O2 ,O1 ) indicates
+ ined firewall log
10:15 PM]- that O2 is performed after O1 , and O2 further narrows the O1 ’s
LINK entries) with a
[Built]-[TCP] connection event set by using a more strict characteristic constraint.
common charact-
(172.23.233.57/3484, As the case described in Fig. 2, analysts may conduct a series of data
eristics (i.e.
10.32.5.58/6667), triage operations with the “is-subsumed-by” (isSub) relationships to
DstPort=6667)
FIREWALL-[4/5/2012 gradually narrow down the connection events. The endpoints of such
10:15 PM]- processes can be viewed as the critical characteristic constraints that
[Teardown]-[TCP] represent a noteworthy set of connection events. To investigate the
(172.23.233.58/3231, endpoints, we consider the subgraph that only consists of the edges
10.32.5.51/6667), of the relationship isSub t and isEql t (isEql is a special case
</Item> of isSub) and name it “isSub” subgraph.
<Item Timestamp=07/31 2) Mutually Exclusive Characteristic Constraints: It’s de-
13:20:43> sirable to have the endpoints of an “isSub” subgraph be mutually
LINK (Same DstPort) exclusive. To ensure the nodes to be mutually exclusive, we examine
</Item> the logical relationships between two nodes, and add additional
<Item Timestamp=08/09 characteristic constraints to one of them as follows. Let Vends
Search a keyword
11:08:01> be the set of endpoints in an “isSub” subgraph, several steps are
in a set of
SEARCH( taken to make sure every two endpoints in this subgraph have
SEARCH connection events
Firewall Log, an “isCom” relationship. (1) If two endpoints have an “isEql”
specified in the
172.23.233.52) relationship, we drop one from Vends . (2) Given Vends , we con-
Firewall log.
</Item> sider the subgraph induced by Vends from the original CC-Graph,
Ginduced =< Vends , {Rlogic } >, where Rlogic = {< vi , vj >
|isEql(vi , vj ) or isSub(vi , vj ) orisCom(vi , vj )}. Only consider-
ing the edges of “isCom” relationships in Ginduced , we apply the
Bron-Kerbosch algorithm [9] to find all the maximum cliques in
Ginduced . A maximum clique is a subset of the nodes, represented
by GC ∈ Vinduced , such that ∀vi , vj ∈ GC , isCom(vi , vj ), and
vk ∈ Vinduced , vk ∈ / GC , and ∀vi ∈ GC , isCom(vk , vi ). The
largest maximum clique is denoted as Vclique . (3) We add each
v ∈ Vends , v ∈ / Vclique into Vclique by removing the overlap between
v and vj ∈ Vclique . For vi ∈ / Vclique , it has an overlap with a
vj ∈ / Vclique , iff there exists a connection event e satisfies both Ci
and Cj . To remove the overlap, we adjust the characteristic constraint
used in vi (say Ci ) to Ci ∧ ¬Cj . If Ci ∧ ¬Cj is not false, we add
Fig. 2. A partial characteristic constraint graph constructed from an analyst’s the adjusted vi to Vclique (Vclique ∈ GC ).
trace. The nodes are the data triage operations. The node text is the index of
their temporal order. The node color refers to whether the data triage operation D. Step 4: State Machine Based on Attack Path Pattern
results in an observation of malicious connection events (red refers to yes, 1) Attack Path Pattern: Given the critical characteristic con-
while blue refers to no). The blue edges are the “isSub” relationships, while strains mined from the CC-Graphs (i.e. the nodes in Vclique ), attack
orange edges are the “isCom” relationships.
path patterns can be constructed based on the critical nodes and
their temporal orders. An attack path pattern is a sequence of

360

Authorized licensed use limited to: University of Obuda. Downloaded on March 28,2022 at 18:55:07 UTC from IEEE Xplore. Restrictions apply.
e3 as a refined instance, which indicating an attack path involving
the pair of hosts (“internal.03” and “external.02”).
“Occurrence rate” is calculated for each pair of hosts that are
involved in the connection events (as source and destination) of the
instances. The value range of the occurrence rate is [0, 1]. It can be
calculated as follows. Given an instantiated characteristic constraint
sequence (C1 , . . . , Cm ), and a pair of hosts (ip1 , ip2 ), if there are
α sets of connection events, each of which satisfy a characteristic
Fig. 3. The connection events (in the boxes) that satisfy the characteristic constraint in (C1 , . . . , Cm ), and the pair (ip1 , ip2 ) appears in all the
constraints (represented by cycles) in an attack path pattern can be further α connection event sets, we say the occurrence rate is α/m.
correlated based on connections between IPs. (The IP addresses are coded as Based on the heuristic and the definition of occurrence rate, we
“internal.XX” or “external.XX”. correlate the attack path instances in DT-SM’s output by the IP
linkage. We set up a value of occurrence rate as the threshold. Any
characteristic constraints in temporal order, which specifies a class
pair of hosts whose occurrence rate is lower than the threshold will
of attack path instances. The temporal relationship is a “can-happen-
not be linked in the instances. If the threshold is too large, it will
before” relationship between two characteristic constraints, denoted
screen out too many instances and result in a small number of output
by Ct (·, ·). Let C1 and C2 be two pre-specified characteristic
instances. However, if it is too small, the post-processing will results
constraints, Ct (C1 , C2 ) refers to the analysts’ knowledge about the
in too many instances which may annoy analysts. Therefore, it’s
attack: it happened and could happen again that a set of connection
critical to decide an appropriate value of the threshold based on the
events that satisfy C1 occur before the other set of events involving
prior knowledge.
the same hosts that satisfy C2 in an attack path. For example, one
typical step in a botnet attack is the IRC communication between
internal workstation with the external C&C servers. The IRC com- IV. E VALUATION
munication can be followed by a data exfiltration using FTP protocol. We implemented the intelligent data triage system. To evaluate
Therefore, this attack can be discovered by the attack pattern path: it, we focus on these aspects: (1) the effectiveness of the DT-SM
“(SrcIP = external servers AND SrcPort = 6667) OR (DstIP = external construction, and (2) the performance of the DT-SM on data triage
servers AND DstPort = 6667)” and “SrcIP = internal workstation with a selected threshold of the occurrence rate.
AND DstPort = 21”.
The “can-happen-before” relationships of the nodes in an attack A. Experiment Dataset
pattern is identified based on the temporal order of the corresponding 1) ARSCA Logs Collected from a Lab Experiment: With the
connection events in the task performed by the analyst. Assume that IRB approved, we recruited 30 full-time professional cybersecurity
the connection events come in sequence over time E = (e1 , . . . , en ), analysts in an experiment. The analysts were asked to accomplish
we say that the sequence of connection events satisfy a sequence a cybersecurity data triage task using ARSCA. ARSCA audited the
of characteristic constraints (C1 , . . . , Cm ) defined in an attack path their operations while they were performing the task [7]. We decided
pattern GC iff ∀Ci (1 ≤ i ≤ m), ∃ep (1 ≤ p ≤ n) satisfies to keep 29 trace logs after manually reviewing them. In total, we
Ci → ∀Cj (1 ≤ j < i), ∃eq (1 ≤ q ≤ p) satisfies Cj . There- collected 29 ARSCA logs with 1104 items. The average log length
fore, given (C1 , . . . , Cm ) and E, the attack path instances detected is 31.17.
based on (C1 , . . . , Cm ) is a sequence of connection event sets, 2) Task Data Sources and Scenario: The cyberattack analysis
attack(C1 ,...,Cm ) (E) = {(E1 , . . . , Em )}, where Eik (1 ≤ i ≤ m) = task is designed based on the cyber situation awareness task in VAST
{eik |eik is a connection event from sourceik1 to destinationik2 Challenge 2012 with a setting of a bank’s network with approximately
that satisfies Ci }. 5000 machines. The VAST challenge provides participants with IDS
2) State Machine: A finite state machine is constructed to alerts and firewall log. VAST Challenge 2012 asked the participants
automate data triage, based on an attack path pattern, named “DT- to identify the noteworthy attack incidents happened in the 40 hours
SM”. A state transition is defined as δ : S × D → S. Given covered by the IDS alerts and firewall logs [5]. The entire data sources
the current state Si ∈ S and a new connection event ei , we have were composed of 23,595,817 firewall logs and 35,709 IDS alerts.
δ(Si , ei ) = Si+1 iff Ci ∈ Si , ei satisfies Ci , and ∃Cj ∈ Si+1 that The task scenario underlying the data sources is a multistage attack
ei satisfies Cj . Therefore, given the input of a sequence of connection path that begins with normal connection event, including regional
events in temporal order, a DT-SM takes one connection event at one headquarters computers communicating with internal headboards
time and examine whether this event triggers any state transition. financial sever and normal web browsing. The attack starts when
Once all the input connection events have been processed, the DT- an internal workstation was infected with a botnet due to an inserted
SM output the instances of the attack path pattern. Each instance USB. The botnet replicated itself to other hosts on the network, and
is a sequence of connection event sets that satisfy a characteristic meanwhile the bonnet communicated with several C&C servers. The
constraint sequence in the attack path pattern. botnet kept infecting additional computers. It attempted to exfiltrate
data using FTP connections but it failed. After that, the botnet
E. Step 5: Post-precessing successfully exfiltrated data using SSH connections. In the following
To enable analysts review and modify the result of data triage, hours, the majority of the botnet communication and data exfiltration
the output of the automated data triage system should be readable kept happening.
for analysts. Therefore, a post-processing is necessary to aggregate The original data sources provided in the VAST challenge are too
the DT-SM’s output. We leverage a heuristic to correlate the output large for human analysts to analyze during our limited task time.
instances based on the linkage of source IP and destination IP of the Therefore, only small portions of data were selected and provided to
connection events: the attacker or the infected machines are likely the participants to enable them to complete the task in the experimen-
to be involved in multiple steps in an attack path. Therefore, given tal time (60 minutes). The selected task data sources include 239 IDS
a characteristic constraints C1 , the IP addresses that appear in the alerts and 115,524 firewall logs, reporting the network connection
connection events that satisfy C1 is very likely to appear in the events that happened in a 10-minute time window (4/5 22:15-22:25).
connection events that satisfy the other characteristic constraint C2 The task data sources contain the evidence of three attack steps:
that Ct (C2 , C1 ). In the case shown in Fig. 3, we correlate e2 and (1) IRC communication between the internal workstations with a set
of external Command and Control (C&C) servers, (2) denied data

361

Authorized licensed use limited to: University of Obuda. Downloaded on March 28,2022 at 18:55:07 UTC from IEEE Xplore. Restrictions apply.
exfiltration attempts using FTP connections, and (3) successful data
exfiltration using SSH connections.
B. Effectiveness of DT-SM Construction
The effectiveness of DT-SM construction is determined by three
intermediate results: (1) the identifed data triage operations, (2) the
CC-Graph constructed, and (3) the attack path pattern constructed
from the CC-Graphs.
We first evaluated the accuracy of the automatic data triage
operation identification by comparing with the data triage operation
identified by human. We had two persons manually parse the ARSCA
logs, and the identified data triage operation serves as the ground
truth. Both persons are familiar with the task and the ARSCA toolkit. Fig. 4. The similarity value changes with different IP occurrence rate.
One person is the main identifer, and the other is the evaluator:
given an ARSCA log, the main coder first read throught the trace 1,810 IDS alerts and 599, 489 firewall log, without overlap with the
and specified the data triage operations in the log; the evaluator task data set.
then read the trace for the second time and proofreaded the data Although the VAST Challenge provides an attack description,
triage operations identifed by the main identifier; everytime when a it’s still not apparent regarding whether each alert/log entry in the
disagreement occured, both of them went back to the ARSCA log data sources is related to attack or not. To determine the ground
and tried to reach an agreement by discussion. Given the 29 ARSCA truth before evaluating the DT-SM’s data triage result, we manually
logs, 1181 log items were manually analyzed and 394 data triage processed the testing data sources.
operations were identified. 2) Selecting the Occurrence Rate for Post-processing: We
The system identified 358 data triage operations in total. Compar- need to choose an occurrence rate as the threshold in post-processing.
ing them with the ground truth, there are 332 data triage operations We select the data sources of another 10-minute time window from
correctly identified by the system. There are 62 operations are not the original VAST data set (4/5 18:05-18:15) for determining the
identified by the system. Therefore, the false positive rate is 0.073 occurrence rate We first identify the instances in the ground truth
and the false negative rate is 0.157. based on the given attack scenario, and then test different values of
Based on the automatically identified data triage operations, 29 the occurrence rate and choose the one that can result in the instances
CC-Graphs are constructed. The average number of nodes is 34.3. similar to the instances in ground truth. The similarity is calculated
The average number of edges is 83.6, including 12.6% isEql edges, by:
23.6% isSub edges, and 63.8% isCom edges. The dominant percent-
age of isCom edges shows that most data triage operations have a 1
mutually exclusive relationship. similarity = ∗
|(ip1, ip2)|
To evaluate the effectiveness of attack path pattern construction,
we constructed an attack path pattern for each CC-Graph. We had # of (ip1, ip2)in SM instances
to exclude 5 pieces of ARSCA log because less than 2 data triage # of (ip1, ip2)in GT instances
(ip1,ip2)inGT
operations were identified in these traces. In total, we have 32 attack
path patterns constructed, containing 81 characteristic constraints. Given the output of DT-SM running on the selected data sources,
The average number of the nodes in the attack path patterns is 2.781. we tested four values of the IP occurrence rate in post-processing:
We evaluated these attack path patterns by checking whether 0, 0.25, 0.5, 0.75 and 1. Fig. 4 shows the similarity values for each
the characteristic constraints in them are critical and mutually occurrence rate. The similarity is closest to 1 when we select 0.75
exclusive. To decide whether the characteristic constraints in an as the occurrence rate.
attack path schema are critical, we mapped them to the analysts’ 3) False Positives and False Negatives: The performance of
answer gathered in the post-task questionnaires about the most the DT-SM is measured by false positive and false negative rates. We
important observations. We found 79 out of the 89 characteristic use eSM to denote the network connection events in the instances
constraints in the 32 attack path patterns mentioned by the analysts outputted by the DT-SM and eT to denote the connection events
that lead to important observations (88.76%). As for the 10 included in the ground truth. We calculate the false positive and false
characteristic constraints not mentioned in analysts’ answer, we negative rate using the following formula:
found 6 of them also lead to hypotheses in analysts’ ARSCA
log. All the characteristic constrains in the attack path pattern are f alse positive = 1−
mutually exclusive. The most common characteristic constraints
(ip1,ip2) |{e |e
SM SM
= eT , eSM ∈ instance}|
included in the attack path patterns are ‘‘DSTPORT = 6667 η
(ip1,ip2) |{e |eSM ∈ instance}|
SM
AND PRIORITY = Info AND PROTOCOL = TCP AND instance(SM )
SERVICE = 6667_tcp’’, ‘‘SRCIP = 172.23.235.*
1
AND DSTIP = 10.32.5.* AND DSTPORT = 21 AND η=
OPERATION = Deny AND PRIORITY = Warning AND |attack path instance|
SERVICE = ftp’’, ‘‘SRCIP = 172.23.234.* AND
DSTIP = 10.32.5.* AND DSTPORT = 22’’.
f alse negative = 1−
C. Performance of DT-SM
(ip1,ip2) |{e |e
SM SM
= eT , eT ∈ instance}|
1) Testing Data: We want to evaluate the performance of the
2
DT-SM on the data sources that are much larger than the data sources instance(GT )
that were used for constructing them (the data sources analyzed by
the analysts when generating ARSCA logs). Considering the task where instance(SM ) refers to the attack path instances in the DT-
data sources selected from a 10-minute time window, we chose a SM output, instance(GT ) is the attack path instances in the ground
100-minute time window (4/6 18:16-19:56) to get the testing data
sources from the original VAST data. The testing data set contains

362

Authorized licensed use limited to: University of Obuda. Downloaded on March 28,2022 at 18:55:07 UTC from IEEE Xplore. Restrictions apply.
truth. {ip1, ip2} refers to an unordered pair of hosts involved in an R EFERENCES
connection event, either of which could be the source. [1] DAmico, Anita, and Kirsten Whitley. The real work of computer
We ran the DT-SM constructed from the operation traces of the network defense analysts. In VizSEC 2007, pp. 19-37. Springer Berlin
29 participates over the testing data (1,810 IDS alerts and 599,489 Heidelberg, 2008.
firewall log). It takes DT-SM less than 1 second to finish the [2] Sadoddin, Reza, and Ali Ghorbani. Alert correlation survey: framework
processing the data which usually takes analysts hours to process. and techniques. In Proceedings of the 2006 International Conference on
The false positive rate is 0.025 and the false negative rate is 0.272. Privacy, Security and Trust: Bridge the Gap Between PST Technologies
The false positive is satisfactory, but the false negative still needs and Business Services, p. 37. ACM, 2006.
to be improved. However, the low recall rate isn’t caused by the [3] Zhai, Yan, Peng Ning, and Jun Xu. Integrating IDS alert correlation and
limitation of the approach but the limitation of the experiment setting. OS-level dependency tracking. In Intelligence and Security Informatics,
pp. 272-284. Springer Berlin Heidelberg, 2006.
The traces of analysts’ data triage operations in analyzing 10-minute-
[4] E. S. M. ArcSight, ArcSight ESM white paper, Online, 2010.
time-window events cannot fully embody all the expertise needed [5] Cook, Kristin, Georges Grinstein, Mark Whiting, Michael Cooper, Paul
for analyzing 100-minute-time-window events. Therefore, the attack Havig, Kristen Liggett, Bohdan Nebesh, and Celeste Lyn Paul. VAST
path pattern constructed based on the data triage operation traces Challenge 2012: Visual analytics for big data. In Visual Analytics
can’t perfectly solve a broader problem (i.e., triage the events in Science and Technology (VAST), 2012 IEEE Conference on, pp. 251-
the 100-minute time window). In the real-world practice, analysts 255. IEEE, 2012.
keep analyzing the continuously incoming data sources. As long as [6] Cheung, Steven, Ulf Lindqvist, and Martin W. Fong. Modeling mul-
ARSCA is deployed, it can automatically generate new logs, and tistep cyber attacks for scenario recognition. In DARPA information
therefore the ARSCA logs can be continuously imported into our survivability conference and exposition, 2003. Proceedings, vol. 1, pp.
284-292. IEEE, 2003.
data triage system and update the attack path pattern accordingly.
[7] Zhong, Chen, John Yen, Peng Liu, Rob Erbacher, Renee Etoty, and
It can result in more comprehensive attack path patterns because a Christopher Garneau. An integrated computer-aided cognitive task anal-
sufficient set of endpoints in the “isSub” subgraph can be identified, ysis method for tracing cyber-attack analysis processes. In Proceedings
and thus may increase the recall rate the DT-SM. of the 2015 Symposium and Bootcamp on the Science of Security, p.
9. ACM, 2015.
V. RELATED WORK [8] Zhong, Chen, John Yen, Peng Liu, Rob Erbacher, Renee Etoty, and
Researchers have recognized the significant role of human analysts Christopher Garneau. ARSCA: a computer tool for tracing the cognitive
in cyber analysis. Several exciting field studies have focused on processes of cyber-attack analysis. In Cognitive Methods in Situation
understanding analysts’ analytical processes [1], [7], [11]. Our current Awareness and Decision Support (CogSIMA), 2015 IEEE International
Inter-Disciplinary Conference on, pp. 165-171. IEEE, 2015.
work is closely related to the ARSCA tool of Zhong et al. [8] because [9] D. R. Aloysius, Bron-Kerbosch Algorithm, 2012.
we directly borrow the operation traces collected by this tool (i.e., [10] Csardi, Gabor, and Tamas Nepusz. The igraph software package for
ARSCA) as the input, instead of reinventing the wheel. This work complex network research. InterJournal, Complex Systems 1695, no. 5
takes a big step forward by constructing semantic models based on (2006): 1-9.
the operation traces and eliciting useful information to automate the [11] Erbacher, Robert F., Deborah A. Frincke, Pak Chung Wong, Sarah
triage analysis. Moody, and Glenn Fink. A multi-phase network situational awareness
Prior works on alert correlation and SIEM are complimentary to cognitive task analysis. Information Visualization 9, no. 3 (2010): 204-
but distinct from our work. Regarding the works on alert correla- 219.
tion, a lot of methods have been proposed using a wide range of [12] Goodall, John R., Wayne G. Lutters, and Anita Komlodi. Developing
expertise for network intrusion detection. Information Technology Peo-
technologies. Sadoddin and Ghorbani [2] provided a good survey
ple 22, no. 2 (2009): 92-108.
that classifies the state-of-the-art alert correlation techniques, each [13] Chen, Po-Chun, Peng Liu, John Yen, and Tracy Mullen. Experience-
of which has pros and cons. It also pointed out that knowledge based cyber situation recognition using relaxable logic patterns. In
acquisition is critical for most alert correlation methods. For example, Cognitive Methods in Situation Awareness and Decision Support
the prerequisites and consequences of alerts need to be specified (CogSIMA), 2012 IEEE International Multi-Disciplinary Conference
for rule-based correlation (e.g., [14]–[16]; attack scenarios models on, pp. 243-250. IEEE, 2012.
(e.g., temporal logic formalism [17]) should be defined for scenario- [14] Ning, Peng, Yun Cui, and Douglas S. Reeves. Constructing attack
based correlation. The focus of our work is the automatic knowledge scenarios through correlation of intrusion alerts. In Proceedings of the
elicitation, with the goal of eliciting the attack path pattern from 9th ACM conference on Computer and communications security, pp.
245-254. ACM, 2002.
analysts’ operations in their previous analysis processes. The DT-SM
[15] Cuppens, Frdric, and Alexandre Miege. Alert correlation in a coop-
has three merits: (1) It eases analysts’ burden by applying the data erative intrusion detection framework. In Security and Privacy, 2002.
constraints they used previously to triage new data so that analysts Proceedings. 2002 IEEE Symposium on, pp. 202-215. IEEE, 2002.
can spend more effort on more detailed investigation or handling new [16] Templeton, Steven J., and Karl Levitt. A requires/provides model for
challenging problems, thus improving the overall quality of incident computer attacks. In Proceedings of the 2000 workshop on New security
reports in an SOC. (2) It’s by natural enterprice specific and network paradigms, pp. 31-38. ACM, 2001.
specific. One DT-SM constructed from one analyst’s operation trace [17] Morin, Benjamin, and Herv Debar. Correlation of intrusion symptoms:
could be found directly useful by another analyst. (3) The DT-SM an application of chronicles. In Recent Advances in Intrusion Detection,
can suit the highly dynamic cyber environment because it takes the pp. 94-112. Springer Berlin Heidelberg, 2003.
[18] Dain, Oliver, and Robert K. Cunningham. Fusing a heterogeneous alert
input of analysts’ operation traces. These traces can be collected
stream into scenarios. In Proceedings of the 2001 ACM workshop on
continuously as long as the capturing toolkit can be launched while Data Mining for Security Applications, vol. 13. 2001.
analysts are performing cybersecurity analysis tasks. [19] Smith, Reuben, Nathalie Japkowicz, Maxwell Dondo, and Peter Mason.
Using unsupervised learning for network alert correlation. In Advances
VI. CONCLUSION in Artificial Intelligence, pp. 308-319. Springer Berlin Heidelberg, 2008.
Aimed to ease cybersecurity analysts’ burden in tedious data triage [20] Sommer, Robin, and Vern Paxson. Outside the closed world: On using
analysis, we developed an automated system which can generate machine learning for network intrusion detection. In Security and
data triage automatons directly from cybersecurity analysts’ operation Privacy (SP), 2010 IEEE Symposium on, pp. 305-316. IEEE, 2010.
traces. We have designed and implemented the new system and [21] Granadillo, Gustavo Daniel Gonzalez. ”Optimization of cost-based
threat response for Security Information and Event Management (SIEM)
evaluated it through a human-in-the-loop case study. This work shows systems.” PhD diss., Institut National des Tlcommunications, 2013.
that it is feasible to elicit attack path patterns by modeling and mining
the traces of analysts’ data triage operations.