Book
Book
1 Introduction 1
1.1 Why a Handbook? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The Use Cases Intend to Solve Various Cybersecurity Challenges through A Unified
DL Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 How to Properly Use This Handbook? . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Organization of Rest of The Book . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
6 AI Detects PC Malware 56
6.1 The Security Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.2 Raw Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.3 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.4 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.5 Model Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.6 Remaining Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.7 Code and Data Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
ii
Chapter 1 Introduction
high volume of cyber data; (b) Do not see the big picture: stovepiped organizational cyber-operation
processes are harmful; (c) Mindsets: unrealistic mindsets on “what are trusted" (e.g., the mindset used
in defending SolarWinds APT) are harmful; (d) Human resources: the performance of junior analysts
is in general much worse than experts.
With the rise of DL, AI technologies provide exciting new opportunities (e.g., accurately detect
and predict cyber threats, improve agility, reduce the costs and improve human analysts’ job perfor-
mance, and improve the level of automation) in addressing the challenges faced by security teams.
For example, (a) the Endpoint Detection & Response solution framework of BlackBerry is powered
by Cylance AI, one of the first Machine Learning (ML) models for cybersecurity. (b) Splunk security
analysts are training various DNNs to serve various purposes on a daily basis.
Observation 4. Even some already highly successful tools may benefit substantially from DL
and RL. For example, although fuzzing tools are nowadays a standard security testing tool in the
industry, their efficiency is still a concern, and the industry has substantial incentives to use AI to
further improve the fuzzing efficiency.
The second reason why we write a handbook is that the existing survey papers have limited
usefulness for engineers, security analysts, and students taking an “AI for Cybersecurity" course.
Partially due to the above observations, more and more survey papers (e.g., [3, 14]) are published
to summarize the new research progress on applying DL and RL to solve cybersecurity challenges.
Although these survey papers provide systemization of the newly developed understandings (and
knowledge) about why DL and RL could be applied to effectively solve particular cybersecurity
challenges, they don’t provide a hands-on introduction to how to apply DL and RL to solve a
particular cybersecurity problem. As a result, these survey papers don’t provide one with a hands-on
learning opportunity, and there is still a wide gap between survey papers written by researchers and
the hands-on learning/training needs of engineers, analysts and students.
The third reason is that a more useful book for engineers, analysts and students should include
the following information:
This book should provide beginners with a “jump start" learning experience. By “jump start",
we mean that a beginner does not need to have any experience in applying DL and RL to solve
a security problem.
This book should provide experienced engineers and security analysts with a reference book
of use cases. Rather than reading the book from beginning to end, this book is intended to be
consulted for information on specific matters (e.g., use cases, code snippets).
Instead of providing a lengthy discussion on the general principles, each chapter just describes
how a specific security problem gets solved, i.e., each chapter just focuses on one use case.
Each use case should be presented in a straightforward way so that the readers can easily navigate
to what they are mostly interested in, such as code snippets and field-by-field presentation of the
training data samples.
The technical details should be self-contained: there is no need to go to elsewhere for additional
technical details.
The use cases should cover the entire machine learning pipeline, so that the learning experiences
2
1.2 The Use Cases Intend to Solve Various Cybersecurity Challenges through A Unified DL Pipeline
of an engineer, an analyst, or a student can be most close to what is going on in the real world.
Learning by doing: datasets and source code snippets should be provided to the readers so that
the learning experiences can be as hands-on as the readers wish.
high Active
Learning
Training Set
Initial Dataset New Data
yes
Figure 1.1: A unified DL pipeline could be used to solve various cybersecurity challenges.
3
1.3 How to Properly Use This Handbook?
As described in [54], the main active learning issues include uncertainty sampling, diversity
sampling, combining uncertainty sampling and diversity sampling, and active transfer learning.
The data annotation step will result in the Training Set which is a set of labeled units of raw
data. Note that the Training Set should be as balanced as possible.
Since the Raw Data usually cannot be directly used to train a DL model, the Data Processing
step of the Set A workflow will transform each member in the Training Set to a training data
sample. Feature Extraction and Data Structure Design are two dominant challenges in this step.
When a pretrained model is suitable for solving the corresponding cybersecurity problem,
the Model Training step would become a Transfer Learning step in which the learning task
designated to solving the cybersecurity problem would be treated as a downstream task.
When no pretrained model is suitable for solving the cybersecurity problem, the Model Training
step will firstly train a model from scratch using the initial set of training data samples. This
sub-step is called Batch Training. When new data samples arrive, Incremental Training will
usually be conducted to using the new data samples to enhance the existing model(s) (e.g., make
it more accurate), but no longer from scratch.
During the Model Training step, hyper-parameter tuning and model architecture design are two
major challenges to address.
Since the computing environment of Model Training is often quite different from that of the
Deployed Model, model deployment itself involves several challenges and issues, including how
to conduct online testing, how to compress a DNN model, how to employ appropriate hardware
acceleration, how to manage the dependencies between the libraries called during the execution
of the model, how to audit a DNN site, and how to reliably upgrade a DNN site.
A Set B workflow is quite straightforward. For example, when newly arrived unit of raw data
needs to be classified, the unit will be firstly sent to the Data Processing component, which
conducts the same kind of data processing described above. Then the corresponding data
sample will be fed into the Deployed Model as an input.
Cyclic workflows, though not always required, could be helpful in some circumstances. For
example, when the audited classification/prediction results (of the Deployed Model) are further
analyzed, false positives and false negatives could be identified, and it is not surprising that
these data samples could make Incremental Learning substantially more effective. Hence, the
industry has incentives to add the corresponding units of raw data to the Training Set. As a
result, a cyclic workflow is formed.
Since the DL pipeline is dedicated to solving cybersecurity problems, it has a few domain-
specific characteristics. For example, the Data Structure Design challenge addressed in the Data
Processing step of a Set A workflow is often associated with in-depth domain knowledge which
only a cybersecurity expert would have. Insights on “which data structure is most appropriate
to represent the raw data" are often hard to be automatically obtained or learned.
4
1.3 How to Properly Use This Handbook?
5
Chapter 2 AI Conducts Two Reverse Engineering
Tasks
Function:
f3 0f 1e fa endbr64
401130: add $0x20,%rsp 48 0 48 0f be 16 movsbq (%rsi),%rdx
83 0 48 89 f8 mov %rdi,%rax
c4 0 48 8d 3d c2 0d 00 00 lea 0xdc2(%rip),%rdi
48 89 d1 mov %rdx,%rcx
20 0
48 39 fa cmp %rdi,%rdx
401134: pop %rbp 5d 0 74 1a je 1264 <strcpy+0x34>
401135: retq c3 -1 31 d2 xor %edx,%edx
401136: push %rbp 55 1 0f 1f 40 00 nopl 0x0(%rax)
mov
88 0c 10 mov %cl,(%rax,%rdx,1)
401137: %rsp,%rbp 48 0
48 83 c2 01 add $0x1,%rdx
89 0 4c 0f be 04 16 movsbq (%rsi,%rdx,1),%r8
e5 0 4c 89 c1 mov %r8,%rcx
40113a: sub $0x20,%rsp 48 0 49 39 f8 cmp %rdi,%r8
75 ec jne 1250 <strcpy+0x20>
83 0
c3 retq
ec 0
Signature:
20 0
number of args: 2
(a) Instruction Sequence (b) Raw Bytes (c) Labels args types: [char*, char *]
return type: char *
(a) A data sample of function boundary. Label 1 (b) A data sample of function signature.
indicates that corresponding byte is the beginning
of a function.
Figure 2.1: Illustration of data samples.
These tools are developed based on domain knowledge, that is the code generation conventions, to
carry out the disassembly and the analysis. However, many factors can affect the generated binary
code, such as the programming language used, the compiler and its options, and different architectures
and operating systems. Furthermore, some commercial software and malware are obfuscated to hide
their logic against RE. Therefore, developing rule-based methods is not an easy task and will require
huge amount of human efforts.
In recently years, machine learning methods have been gradually introduced to tackle various
RE tasks (e.g., function boundary identification, function prototype inference, etc.) and showed big
potential. Among these methods, deep learning performed promising results. Researchers observed
two characteristics that make deep learning effective in RE: firstly, comparing with traditional machine
learning, deep learning has strong representation learning ability and can discover intricate structures
in high-dimensional data [41, 61]. Secondly, even though deep learning generally require a larg dataset
to train a high-quality model, it is not a challenge for most RE problems to generate large amount
training samples.
2.3 DL Pipeline
In this section, we follow the unified DL pipeline (Figure 1.1) introduced in chapter 1 to illustrate
the deep learning approaches for RE. Different RE tasks often share some common characteristics.
Therefore, in each of the following sub-sections, we will firstly present the common characteristics
shared by the two RE tasks. After the common characteristics are presented, we will present the
distinct characteristics of each of the two RE tasks.
7
2.3 DL Pipeline
8
2.3 DL Pipeline
1 def instructions2int8(instructions):
2 uint8 = []
3 for ins in instructions:
4 b = ins.binary
5 for i in range(len(b)):
6 uint8.append(b[i])
7 return uint8
9
2.3 DL Pipeline
1 0 0 0 0 0 1 1 1 1 1 0 1 1 0 0 0 0 0 0 1 1 0 0
83 ec 0c
Figure 2.2: Binary encoding.
1 def read_asmcode_corpus(asmcode_corpus):
2 asmcode_corpus = inst2vec.refine_asmcode(asmcode_corpus)
3 list_instruction = []
4
5 asmcode_corpus_list_split = asmcode_corpus.split(’\n’)
6
7 for one_instruction in asmcode_corpus_list_split:
8
9 one_instruction=one_instruction.replace(’,’,’ ’) # for split
10 one_instruction=one_instruction.split()
11 list_instruction.append(one_instruction)
12
13 return list_instruction
14
15 word_list = read_asmcode_corpus(asmcode_corpus):
16 model = Word2Vec(word_list, size=vectorsize, window=128, min_count=1, workers=4, iter = 10)
disadvantage is that the semantic meaning of each instruction is hidden in the raw bytes encoding. For
example, the X86 Instruction Set Architecture (ISA) adopts variable-length encoding, which means
that common instructions typically have a shorter encoding so that they need less space in instruction
cache; whereas less common instructions have a longer encoding. Therefore, the semantic meaning
represented by bytes in different instructions is different, which pose big challenge for the deep neural
networks to understand the instruction. Consequently, even the raw bytes contains as much information
as disassembled code, it is questionable whether the neural networks can effectively learn the semantic
meaning from such raw byte encoding.
Encoding of Disassembled Instructions. Since assembly code carries semantic meaning of instruc-
tions in a more comprehensible way, some researchers choose and encode features from assembly
code. For example, [45] adopts opcodes extracted from assembly code, and encodes them through
one-hot encoding, but this method completely ignores the information from operands. Instruction2Vec
proposed by [42] adopts both opcode and operand from instructions, and encodes each instruction as
a nine-dimensional feature vector.
Representation Learning for Instruction. Another methods is to learn an instruction embedding
through representation learning. For example, Asm2Vec [21] makes use of the PV-DM neural net-
work [40] to jointly learn instruction embeddings and function embedding. Specifically, this model
learns latent lexical semantics between tokens and represents an assembly function as an internally
weighted mixture of collective semantics. The learning process does not require any prior knowledge
about assembly code, such as compiler optimization settings or the correct mapping between assembly
functions [21]. Experiments show that Asm2Vec is more resilient to code obfuscation and compiler
optimizations than aforementioned two embedding methods.
Pretrained Model for Instruction Embedding. Although recent progress in instruction represen-
10
2.4 Model Architecture
tation learning is encouraging, there are still some challenges to learn a good representation. As
mentioned in [44], the complex internal formats of instructions are not easy to learn. For instance,
in x86 assembly code, the number of operands can vary from zero to three; an operand could be a
CPU register, an expression for a memory location, an immediate constant, or a string symbol; some
instructions even have implicit operands, etc.
Secondly, some high level semantics (e.g., data flow) hidden in instruction sequence can not be
easily captured by deep learning model without some specific designs. Therefore, pretrained models
(e.g., [44] and [38]) with specific pre-training tasks and a large dataset to reveal complex internal
structures and high level semantics are proposed. Such a well-trained model can be applied to many
downstream tasks and achieve good results.
11
2.4 Model Architecture
1 tokens1 = [self.vocab.sos_index]
2 segment1 = [1]
3 i = 1
4 for ins in bb_pair[0].split(";")[-5:]:
5 if ins:
6 for token in ins.split():
7 tokens1.append(self.vocab.stoi.get(token, self.vocab.unk_index))
8 segment1.append(i)
9 tokens1.append(self.vocab.eos_index)
10 segment1.append(i)
11 i += 1
Deep learning is supposed to be good at building models to learn intricate structures and catch
the underlying complicated features, even from high-dimensional data [41, 61]. However, it is better
that the developer can represent the data sample in a data structure that can direct expose its internal
formats relationships to the deep learning model.
Accordingly, to capture the complex internal formats of instructions and relationships among
opcode and operands inside each instruction, it worths to adopt a fine-grained strategy to decompose
instructions as mentioned by [44]. To achieve this goal, PalmTree [44] considers each instruction in
a code sequence as a sentence and decompose it into basic tokens, and adopt the Transformer [75] to
learn embedding for each sentence. The piece of code for tokenization is shown in Code 2.3, and an
example of tokenized assemble code is shown in left of Figure 2.3. Taking the sequence of tokens as
inputs, PalmTree adopts the Transformer to learn the embedding for instructions.
12
2.4 Model Architecture
In order to enable the designed deep neural network to understand the internal structures of
instructions and dependency among instructions, PalmTree makes use of three pre-training tasks as
shown in Figure 2.3:
1. A Masked Language Model (MLM) to understand the internal structures of instructions. MLM
predicts the masked tokens within instructions.
2. A Context Window Prediction (CWP) to capture the relationships between instructions. CWP
infers the word/instruction semantics by predicting two instructions’ co-occurrence within a
sliding window in control flow.
3. Def-Use Prediction (DUP) to learn the data dependency (or def-use relation) between instruc-
tions. DUP predicts if two instructions have a def-use relation.
Code 2.4 shows the code snippet for the pre-training tasks. Line 6, Line 3, and Line 2 calculates the
loss for MLM, CWP, and DUP, respectively. Line 9 aggregates the total loss.
The transformer’s encoder produces a sequence of hidden states as output and each hidden states
is corresponding to a token from the input. The PalmTree adopts the mean pooling of the hidden states
of the second last layer to represent the instruction representation. However, other pooling methods
such as max pooling and sum pooling are also possible to achieve the similar purpose.
Noting: The purpose of the pre-training tasks is to learn good representation for instructions
through unsupervised learning. Therefore, pre-training tasks are not always necessary if the labeled
training set of a downstream task is big enough to learn a good model.
13
2.5 Model Training Issues
data sample pretrained model instruction embedding bi-directional LSTM output layer label
Figure 2.4: Overview of model for function signature generation.
1
2 def predict(self, inputs, source):
3 # generate embedding for each bytes from the instruction sequence.
4 inputs_embedded = self.embedding_layer(x_input)
5
6 # generate representation for bytes through bi-directional RNN/LSTM.
7 inputs_embedded = self.encoder_biRNN(inputs_embedded)
8
9 # predict label for each bytes based on the learned representation through an output layer.
10 class_pred = self.class_pred_net(inputs_embedded)
11 return class_pred
variants (LSTM or GRU) are adopt to aggregate the instruction embedding of the target function.
Thirdly, a output layer are adopt to predict the type signature for the target function. Code 2.5 shows
the implementation of the model.
14
2.6 Model Performance
15
2.9 Remaining Issues
16
Chapter 3 AI Detects Android Malware
Code 3.1: The notification with full intent and set as “call” category 1.
Details about a new Android ransomware variant were revealed on October 8, 2020 [68].
Basically, to release its ransom note, it utilizes the “call” notification and “onUserLeaveHint()”
jointly to trigger the ransom screen. As shown in Code 3.1, the malware creates a notification
builder. Method setCategory is to make this notification need immediate user action. Method
setFullScreenIntent is to wire the notification to a GUI so that it pops up when the user clicks it.
To continuously occupy the screen with ransom note, the malware needs to invoke the automatic
pop-up of the ransomware screen without user interaction. As shown in Code 3.2, the malware
overrides the onUserLeaveHint() callback function of the Activity class. Whenever the malware
screen is pushed to the background, the callback function will be called, bringing the in-call activity
to the foreground. Since the malware has already hooked the RansomActivity intent with the
notification type set to be “call", a chain of function calls has been formed for the malware to occupy
the screen.
18
3.4 Feature Engineering
Feature Engineering
Key API Set
Extract
Metadata Feature- Training Random
App Metadata Embedded Dataset
Encoding Forest
Cross-
App Model Validati
Retraining on
Deployed Model
Newly Online Testing
Submitted
Apps
FP and FN
Collection
Figure 3.1: Machine learning pipeline for Android malware detection (proposed in [33]).
analysis engine built in [33] based on Google’s official Android emulator [4] and the Xposed Hooking
framework [85]. The intended metadata includes requested permissions, invoked APIs, and used
intents. To achieve high UI coverage, the dynamic analysis engine adopts Monkey UI exerciser [74]
to generate UI event streams at both application and system levels. With these extracted metadata,
the Feature Engineering component is responsible for selecting the key API set for further feature-
embedded encoding. Afterwards, these encoded feature vectors will be used by the Random Forest
machine learning algorithm as the Training Dataset. After cross-validation, the trained model will be
deployed in the production environment for online testing. In the meanwhile, false positives and false
negatives will be collected to prepare for model retraining. The periodic retraining time is set to be
one month in [33]. During model retraining, the newly submitted apps and the existing app dataset
will go through the whole procedure described above to obtain the retrained model to accommodate
the evolving Android malware.
19
3.4 Feature Engineering
1 public void sendTextMessage (String destinationAddress, // Sting: This value cannot be null.
2 String scAdress, // String: This value may be null.
3 String text, // String: This value cannot be null.
4 PendingIntent sentIntent, // PendingIntent: This value may be null.
5 PendingIntent deliveryIntent, // PendingIntent: This value may be null
6 long messageId // long: An id that uniquely identifies the message requested to be sent. Used for
logging and diagonostics purposes. The id may be 0.)
Feature engineering is to identify features to better represent the target problem being handled
by machine learning and further improve its accuracy. It usually uses domain-specific knowledge
or automated methods to extract, select, or construct the right features. For this Android malware
detection use case, the key to achieving high accuracy is feature selection, which is essentially API
selection in the approach proposed in [33]. Given that the current Android SDK provides over 50,000
APIs, a main issue is whether all of the APIs should be selected. As demonstrated in [33], only a
small portion of APIs need to be selected. (In fact, 426 key APIs are selected as features.) Regarding
why only a small portion of APIs, the main reasons are as follows.
Since this technique records invocation of APIs during the dynamic emulation of the tested app,
the dynamic analysis time will be substantially impacted by the number of selected APIs.
A better detection accuracy can be achieved by selectively tracking a smaller portion of APIs
than all 50,000 APIs.
Some APIs complement each other with regard to functionality, which can be combined together
to enhance the accuracy.
20
3.5 Training Data
Select APIs with the highest correlation with malware (Set-C), which results in top-260 highly
correlated APIs.
Select APIs that need restrictive permissions (Set-P), which has 112 APIs. To ensure the
privacy/security requirement of user information, permissions to access specific information or
execute particular functions need to be granted for an app [83]. There two tools Axplorer [20]
and PScout [22] can be utilized to do selection.
Select APIs that perform sensitive operations (Set-S), which include 70 identified APIs. Selec-
tion is based on domain knowledge, considering the sensitive operations previously utilized to
realize attacks [13, 26, 90, 95].
Combing the above, which leads to a total of 426 key APIs, i.e., Set-P∪Set-S∪Set-C, as shown
in Figure 3.2.
Figure 3.2: Number of APIs in Set − C, Set − P , Set − S and their overlaps [33].
21
3.6 Machine Learning
The Random Forest algorithm is adopted to classify an app as malware or non-malware. During
the dynamic emulation of each app, the invocation status of the tracked APIs (API calls’ names and
parameters) is logged. As shown in Code 3.4, each emulation is assigned with an unique task id.
For example, task id 20170718000003300 corresponds to an app named “meimeicaicaicai”. One-Hot
encoding [33] is employed to convert the log to a feature vector comprising a total of n bits, where n
is the total number of tracked APIs. As shown in Code 3.5, each bit corresponds to a tracked API – if
the API is invoked, the corresponding bit is set as 1; otherwise it is 0.
1 -1 0 ProcessId{:}2871{|}PackageURI{:}file:///data/share1/task/com.tle.erfpoz.apk{|}FileMd5{:}45
e6ef0e6b6fa2dc122ee6d18e29fca1{|}FileName{:}/data/share1/task/com.tle.erfpoz.apk{|}ProcessName{:}
app_process{|}OperatorProcessName{:}system_server 329 InstallPackage 5441 5441 0 20170718000003300
2 -1 0 Priority{:}1000{|}PackageName{:}com.tle.erfpoz{|}Actions{:}android.net.conn.CONNECTIVITY_CHANGE;
android.intent.action.ACTION_POWER_CONNECTED;android.intent.action.DATA_CHANGED;android.intent.action
.USER_PRESENT{|}ClassName{:}com.pz.test.TestReceive{|}OperatorProcessName{:}system_server 329
DefineReceiver 5443 5443 1 20170718000003300
3 -1 0 Priority{:}1000{|}PackageName{:}com.tle.erfpoz{|}Actions{:}com.diamondsks.jaaakfd.com.mo.action.
ACTION;android.intent.action.USER_PRESENT{|}ClassName{:}com.ast.sdk.receiver.ReceiverM{|}
OperatorProcessName{:}system_server 329 DefineReceiver 5443 5443 2 20170718000003300
4 -1 0 Priority{:}2147483647{|}PackageName{:}com.tle.erfpoz{|}Actions{:}android.provider.Telephony.
SMS_RECEIVED{|}ClassName{:}com.mj.jar.pay.InSmsReceiver{|}OperatorProcessName{:}system_server 329
DefineReceiver 5443 5443 3 20170718000003300
5 -1 0 Priority{:}2147483647{|}PackageName{:}com.tle.erfpoz{|}Actions{:}android.provider.Telephony.
SMS_RECEIVED;SEND_SMS_ACTION1;SEND_SMS_ACTION2;GET_SMS_ACTION;android.intent.action.USER_PRESENT{|}
ClassName{:}com.android.mtools.MyReceiver{|}OperatorProcessName{:}system_server 329 DefineReceiver
5443 5443 4 20170718000003300
6 -1 0 Priority{:}0{|}PackageName{:}com.tle.erfpoz{|}Actions{:}android.net.conn.CONNECTIVITY_CHANGE;
android.intent.action.USER_PRESENT;android.intent.action.BOOT_COMPLETED{|}ClassName{:}com.b.ht.JDR{|}
OperatorProcessName{:}system_server 329 DefineReceiver 5443 5443 5 20170718000003300
7 -1 0 Priority{:}2147483647{|}PackageName{:}com.tle.erfpoz{|}Actions{:}android.provider.Telephony.
SMS_RECEIVED;android.net.conn.CONNECTIVITY_CHANGE;android.intent.action.BATTERY_CHANGED;android.
intent.action.SIM_STATE_CHANGED;android.intent.action.NOTIFICATION_ADD;android.intent.action.
SERVICE_STATE;android.intent.action.NOTIFICATION_REMOVE;android.intent.action.NOTIFICATION_UPDATE;
android.bluetooth.adapter.action.STATE_CHANGED;android.intent.action.ANY_DATA_STATE;android.net.wifi.
STATE_CHANGE;android.intent.action.BOOT_COMPLETED;android.intent.action.SCREEN_ON;android.intent.
action.USER_PRESENT{|}ClassName{:}com.door.pay.sdk.sms.SmsReceiver{|}OperatorProcessName{:}
system_server 329 DefineReceiver 5443 5443 6 20170718000003300
8 -1 0 Priority{:}2147483647{|}PackageName{:}com.tle.erfpoz{|}Actions{:}android.intent.action.
BOOT_COMPLETED;android.intent.action.USER_PRESENT{|}ClassName{:}com.emag.yapz.receiver.BootReceiver
{|}OperatorProcessName{:}system_server 329 DefineReceiver 5443 5443 7 20170718000003300
9 -1 0 Priority{:}2147483647{|}PackageName{:}com.tle.erfpoz{|}Actions{:}android.provider.Telephony.
SMS_RECEIVED;android.provider.Telephony.SMS_DELIVER{|}ClassName{:}com.emag.yapz.receiver.
ZPayReceiver2{|}OperatorProcessName{:}system_server 329 DefineReceiver 5443 5443 8 20170718000003300
22
3.7 Model Deployment
1 [[0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0,
1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0,
1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,
0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0,
1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1,
0, 0, 0, 1, 0, 0],
2 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0]]
output branch according to its value until it reaches the leaf node, and the classification result stored
in the leaf node is used as the decision result. The random forest algorithm is essentially an ensemble
method, which can effectively reduce over-fitting. It basically is to sample several small sets from the
original training dataset, and then train the model on each small set. It takes the average (regression)
or vote (classification) of all model outputs.
Code 3.6 shows a code snippet used to implement the above-described Random Forest machine
learning algorithm. In the experiments conducted in [33], the hyperparameters of the model are
configured based on domain knowledge, which are included in the repository created by the authors
of [33], which will be shortly mentioned in Section 3.9.1.
23
3.7 Model Deployment
24
3.8 System Evolution
25
3.8 System Evolution
26
Chapter 4 AI Detects Abnormal Events in Sequential
Data
4.2 Dataset
We introduce two datasets, UNSW-NB15 [56] and SOSP2009 [87].
1https://research.unsw.edu.au/projects/unsw-nb15-dataset
2https://research.unsw.edu.au/projects/unsw-nb15-dataset
4.3 Data Processing
publicly available to help researchers develop data-driven attack detection techniques. Table 4.1 shows
the samples for each class and their percentage.
28
4.3 Data Processing
Table 4.3: Some normal and abnormal block IDs in the HDFS dataset.
BlockId Label BlockId Label
blk_-3544583377289625738 Anomaly blk_-50273257731426871 Normal
blk_-9073992586687739851 Normal blk_4394112519745907149 Normal
blk_7854771516489510256 Normal blk_3640100967125688321 Normal
blk_1717858812220360316 Normal blk_-40115644493265216 Normal
blk_-2519617320378473615 Normal blk_-8531310335568756456 Anomaly
In the data processing phase, we firstly re-crafted the log keys into three key sets K0 (new base),
K1 , and K2 . In total, they hold 31, 101 and 304 log keys, respectively. Since labels are given to sessions
and block IDs, it is safe to use log keys to encode each event (i.e., log entry) without modifying the
original event sequences. The statistics of each key set under configuration seqlen = 10 is shown in
Table 4.4. These key sets are from the same log, except that K1 and K2 discard less information by
reattaching add-on strings; for example:
ki ∈ K0 : "Received block"
1st add-on :"of size 20 − 30MB "
(4.1)
2st add-on :"from 10.250.*"
kj ∈ K1 :"Received block of size 20 − 30MB from 10.250.*"
29
4.3 Data Processing
30
4.3 Data Processing
For split of training dataset and testing dataset: the training dataset for K0 and K1 consists
of 200,000 normal sessions, and the training dataset for K2 consists of 100,000 normal sessions.
The number of sessions in dataset for K2 is smaller due to the computational limitation within our
experiment environment. The testing datasets for K0 , K1 , and K2 each includes all 16,868 abnormal
sessions. Beside abnormal sessions, the testing dataset for K0 and K1 include 200,000 normal sessions,
and the testing dataset for K2 include 100,000 normal sessions.
The data processing code snippet is shown in Code 4.3. It extracts source IP, source port,
destination IP, and destination port, and then combine them as an event which is loaded into a
sequence. More importantly, label and log keys are inevitable. For log keys, some occurrences with
values of “-" and “spaces" are removed. Some text types that include numeric values are transformed
to number types. The feature column’s median is used to replace null values. Finally, we attach some
add-on keys and make it a log key sequence (more details are in “data/unswnb15/key.py").
1 # define how to read sequences from file
2 def readSequences(ip, filename):
3 sequence = {}
4 label = {}
5 with open(os.path.join(args.input, ip, filename), ’rt’) as fin:
6 csvfin = csv.reader(fin, delimiter=’,’)
7 for line in csvfin:
8 datetime = data.unswnb15.key.getDateTimeFromLine(line)
9 srcip = line[data.unswnb15.key.srcip]
10 dstip = line[data.unswnb15.key.dstip]
11 dstport = line[data.unswnb15.key.dsport]
12
13 subject = ’-’.join([’from’, srcip, ’to’, dstip, ’on’, str(datetime.day), str(datetime.hour),
14 str(datetime.minute // args.window_size)])
15 slabel = data.unswnb15.key.getLabelFromLine(line)
31
4.4 Model Architecture
32
4.4 Model Architecture
𝑦0 𝑦1 𝑦2 …… 𝑦𝑖
𝑥0 𝑥1 𝑥2 …… 𝑥𝑖
33
4.4 Model Architecture
We train an additional embedding layer along with other layers instead of utilizing other existing
embedding approaches to generate word embedding vectors since the embedding function can be
customized in this way (shown in Code 4.5 which is implemented by Dablog class in file “models/d-
ablog.py").
1 self.label_input = keras.layers.Input(shape=(None,), name=’Label_Input’)
2 self.label_embed = keras.layers.Embedding(self.n_labels, self.config.hidden_unit, mask_zero=True, name=’
Label_Embed’)
3 self.label_dense = self.label_embed(self.label_input)
34
4.4 Model Architecture
To tackle time-sensitive events, the objective function of this autoencoder also follows the typical
deep autoencoder. We set X is the input distribution, ψ ◦ ϕ(X) is the target distribution so the loss
function can be written like:
ϕ, ψ = arg min∥X − Y ∥2
ϕ,ψ
The reason why we use rev function is that Y is in reverse order from Xe and due to LSTM’s
hidden state. As illustrated in Figure 4.3, we can simply represent LSTM network by:
ht = LSTM (xt , ht−1 ) (4.3)
where ht is the hidden states at time step t, and xt is the current data point. As we could see, the
LSTM iteratively calculates this function from step 1 to the last time step. Similar to function stack,
we can think this encoder procedure like pushing xt into a stack hT , decoder procedure is popping xt
out from a stack hT . Therefore, Xe and Y are in reverse order.
This autoencoder is not conditional which means that it does not provide a condition data point
ŷτ = eT −τ +1 to the decoder ϕ when it is decoding ŷτ +1 for any ŷτ in Y = [yτ | 1 ≤ τ ≤ T ] though
conditional predictors can provide better results. First, conditional autoencoder acts as a hint which
tells predictors which suffix should be decoded, whereas provides no additional purpose for the
autoencoder. Second, because adjacent events normally have significant short-term dependencies, it
is not optimal to offer a condition that causes the model to quickly pick up short-term dependencies
but ignore long-term connections. Code 4.6 shows a code snippet implementing the autoencoder.
1 elif mode == ’autoencoder’:
2 # autoencoder layer
35
4.4 Model Architecture
3 nn = keras.models.Sequential(name=’RNN’)
4 nn_sizes = [int(nn_size / i) for i in range(1, self.config.hidden_layer + 1)]
5 nn_sizes = [] + nn_sizes + nn_sizes[:: -1]
6 for i in range(1, self.config.hidden_layer):
7 nn.add(keras.layers.LSTM(nn_sizes[i], activation=’relu’, return_sequences=True, name=’LSTM_’ + str
(i)))
8 i = self.config.hidden_layer
9 if self.config.use_repeat_vector: # bi-direction
10 nn.add(keras.layers.LSTM(nn_sizes[i], activation=’relu’, return_sequences=False, name=’LSTM_’ +
str(i)))
11 nn.add(keras.layers.RepeatVector(seqlen))
12 else:
13 nn.add(keras.layers.LSTM(nn_sizes[i], activation=’relu’, return_sequences=True, name=’LSTM_’ + str
(i)))
14
15 for i in range(self.config.hidden_layer + 1, 2 * self.config.hidden_layer + 1):
16 nn.add(keras.layers.LSTM(nn_sizes[i], activation=’relu’, return_sequences=True, name=’LSTM_’ + str
(i)))
17 nn.add(keras.layers.TimeDistributed(keras.layers.Dense(n_labels, activation=’softmax’), name=’Softmax’
))
18 # model
19 if n_floats == 0:
20 model = keras.models.Model(inputs=label_input, outputs=nn(label_dense))
21 else:
22 model = keras.models.Model(inputs=[label_input, float_input], outputs=nn(merge))
23 model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[’acc’])
36
4.5 Hyperparameter Tuning
Narrowly, the objective function of this event classifier γ is based on categorical cross entropy
loss defined by this formula:
T
X T X
X V
L(P̂, P) = L (xT −τ +1 , Pτ ) = − p̂i × log (pi ) (4.5)
τ τ i
For the wording options in anomaly detection, rank-based and threshold-based are two common
ones adopted by previous work. In order to compare with the existing work, DabLog adopted rank-
based criterion, though it has drawback which will be discussed in the last part. Say a discrete event et
is an instance of k̂i , a rank-based criterion will consider et anomalous if pi is not in top-N prediction
(as shown in Figure 4.4).
37
4.6 Model Deployment
4.7 Evaluation
We leverage accuracy and F1 Score to compare the performance of different models.
3https://www.tensorflow.org/lite
38
4.7 Evaluation
For SOSP2009 dataset, from Figure 4.7a, Baseline has its peak F1 score 87.32% at θN = 10%,
DabLog reaches its peak F1 score 97.18% at θN = 9%. By comparing the trends, it could easily get
that DabLog has a higher peak and wider range hence DabLog is more advantageous for critics than
Baseline and other two: DeepLog, nLSALog. Similar as Figure 4.7b, DabLog has its peak F1 score
94.15% at θN = 6%, whereas baseline only achieves F1 score 80.47% at θN = 6%.
For UNSW-NB15 dataset, from Figure 4.6a and Figure 4.6b, both Baseline and DabLog have
a smooth curve and wide plateu which means UNSW-NB15 dataset has less infrequent data point.
Besides, they both achieve similar F1 score but DabLog still has a slight higher F1 score. Therefore,
above the comparisons, DabLog is more advantageous than Baseline.
(a) 366 keys upon UNSW-NB15 dataset. (b) 706 keys upon UNSW-NB15 dataset.
Figure 4.6: Comparison between baseline and DabLog trained on UNSW-NB15.
39
4.7 Evaluation
(a) 101 keys upon SOSP2009 dataset. (b) 304 keys upon SOSP2009 dataset.
Figure 4.7: Comparison between baseline and DabLog trained on SOSP2009.
higher precision means lower false positive rates. And Figure 4.8b shows that while recall rate of
Dablog and baseline both decrease with θN , Dablog decreases slower since baseline cannot identify
structural abnormal sequences as it would have more FNs. And We will explain this in the later part.
(a) Precision upon different keys. (b) Recall rate upon different keys.
Figure 4.8: Comparison between baseline and DabLog trained on SOSP2009 dataset.
As we said some frameworks like baseline, DeepLog, and nLSALog cannot idetify structural
abnormal sequences, we would like to borrow the example from DabLog to explain it.
We select such a case that DabLog reports an abnormal case as abnormal, whereas baseline report
it as normal. In this session, it has total 34 events which are listed in Table 4.8. DabLog reported the
subsequences s23 , s28 , s29 , and s30 abnormal, where s23 = [e14 , ... , e23 ] and s30 = [e21 , ... , e30 ].
These subsequences are considered abnormal because DabLog could not correctly reconstruct
particularly the 21st event e21 . That is, the key k3 is not within the top- 9% reconstructions for e21 . Top-
9% reconstructions include k4 , variants of k5 , variant of k6 ="add StoredBlock: blockMap updated
...", and k7 ="EXCEPTION: block is not valid ...". These event keys, except k7 , are frequent keys each
dominates over 0.1% of the dataset. Interestingly, here DabLog expects not only frequent keys, but
also an extremely rare event key k7 (which dominates 0.0017% ) before k5 . Since these expected
keys in top- 9% reconstructions for e21 are related to exception, verification, or blockMap updates,
we believe that the reconstruction distribution is derived for causality relationship with e23 (which
is related block transmission) rather than for e19 (which is related to block deletion), even though
DabLog knows a deletion is asked at e19 as it has correctly reconstructed e19 . Our interpretation is
40
4.8 Code, Data, and Other Issues
that, DabLog expects a cause at e21 that leads to the exception at e23 , and it is the absence of causality
before e23 making the sequence structurally abnormal.
In contrast, Baseline does not predict e21 to be any of these keys after s20 = [e11 , . . . , e20 ] . In
other words, Baseline does not expect a cause at e21 , because it cannot foresee e23 = k5 . With the
fundamental limitation of unable to exploit bi-directional causality, Baseline is incapable of detecting
such a structurally abnormal session. Therefore, we believe it is necessary for an anomaly detection
methodology to see sequences as atomic instances and examine the bi-directional causality as well
as the structure within a sequence. Single-direction anomaly detection like Baseline cannot identify
structurally abnormal sequences.
Table 4.8: Example of abnormal sequence events.
e14 Starting thread to transfer block
e15 Receiving block within the localhost
e16 blockMap updated: 10.251.∗ added of size 60 − 70MB
e17 Received block within the localhost
e18 Transmitted block within the subnet
e19 k1 ask 10.251.* to delete block(s)
e20 k2 blockMap updated: 10.251.* added of size 60-70 MB
e21 k3 Deleting block /mnt/hadoop/dfs/data/current/...
e22 k4 Verification succeeded for block
e23 k5 Got exception while serving block within the subnet
e24 Got exception while serving block within the subnet
e25 Verification succeeded for block
e26 delete block on 10.251.*
e27 delete block on 10.251.*
e28 delete block on 10.251.*
e29 ask 10.251.* to delete block(s)
e30 Deleting block /mnt/hadoop/dfs/data/current/...
4https://github.com/PSUCyberSecurityLab/AIforCybersecurity/tree/main/Chapter4-BiLSTM-For-An
omaly_detection
5https://github.com/PSUCyberSecurityLab/AIforCybersecurity/tree/main/Chapter4-BiLSTM-For-An
omaly_detection#misc
6https://github.com/logpai/loghub
41
4.8 Code, Data, and Other Issues
42
Chapter 5 AI Detects DNS Cache Poisoning Attack
Attacker
Local DNS server User machine
their packets to follow the protocol and make them look very similar to genuine DNS
packets. Accordingly, there are substantial incentives to apply machine learning to detect
this attack. However, DNS cache poisoning attack is a session-based attack. Attack
packets have to be crafted based on user/server packets, and the attack may need one or
more attack packets. As a result, each data sample needs to contain information from
multiple packets, which makes feature engineering a daunting barrier.
In this use case, demonstrated in a prior work [97], deep learning is applied to help cross the
barrier.
The authors leverage protocol fuzzing to mutate the contents of network packets,
specifically, the values of some fields in the packets. Using protocol fuzzing for data
generation has the following benefits: 1) With protocol fuzzing, a large variety of malicious
network packets for a chosen network attack can be generated at a fast speed. 2) Since the
network packets are all generated from the chosen network attacks, they can be labeled
as malicious packets automatically without many human efforts. 3) Protocol fuzzing
can generate data with more variations than real world data, or even data that are not yet
observed in real world. 4) Protocol fuzzing can generate and cover malicious data samples
which may otherwise be overlooked when applying deep learning. In deep learning, the
changed values for the fuzzing fields may make the malicious data samples misclassified
as benign. With protocol fuzzing, if the malicious data are generated in attacks, they will
be labeled as malicious automatically, so they will not be omitted in the malicious data
sets. In addition, the above-mentioned merits remain when protocol fuzzing is leveraged
to generate the needed benign network packets.
44
5.3 Labeling DNS Sessions
It should be noted that the fuzzing is done in a way that the packet’s validity and
the session’s validity are not harmed. To preserve validity at the packet level, certain
fields that affect the packets’ validity are never fuzzed, such as fields of checksum values
and packet lengths. The values of those fields should not be arbitrarily changed. When
fuzzing other fields, their values should always be within their valid ranges. For example,
the most basic valid range for a field of one byte is [0, 1, 2, . . . , 255]. To preserve validity
at the session level, when choosing the fields to be fuzzed, only fields which can keep the
attack session complete and successful are chosen.
The testbed contains three machines: a local DNS server, a user machine, and an
attacker machine. The user machine is configured to contact the local DNS server to
resolve domain names.
In the malicious scenario, the user machine is to ask for the IP address of one specific
domain name from the local DNS server using command dig. The domain name is one
that does not have a corresponding record on the local DNS server, thus enabling the DNS
cache poisoning attack towards it. The attacker machine sniffs for DNS queries with that
specific domain name sent out from the local DNS server, and responds them with fuzzed
DNS responses with falsified IP addresses. Then, the DNS cache gets poisoned, and the
user machine gets the falsified DNS record. The user machine is to send out DNS queries
periodically, so that the above process happens lots times and a large amount of data can
be generated. However, as discussed earlier, if the local DNS server has the record for
the domain name in its cache, it will not send out DNS queries for it, which is why the
DNS cache of the local DNS server is flushed periodically, so that it remains vulnerable
in different iterations. If the attack is successful, the falsified IP addresses can be seen on
the results of dig.
In the benign scenario, a list containing 4098 domain names is prepared. During each
iteration, the user machine randomly chooses one domain name from the list, and asks
the local DNS server for its IP address. To resemble the malicious scenario, the cache of
local DNS server is also flushed periodically so that the local DNS server always needs
to communicate with the global DNS server.
This use case uses network traffic data, which is network logs. The packet capturing
happens at the victim’s side, that is, the local DNS server.
Table 5.1 shows several examples of the captured DNS cache poisoning packets. All packets
shown are from one DNS cache poisoning session. 192.168.100.128 represents the user machine,
and 192.168.100.50 represents the local DNS server. Packet 1 shows the user machines asks the
local DNS server for DNS record of www.example.net; packets 2 to 5 are local DNS server’s query
packets to the global DNS server; packet 6 is the attack packet sent to the local DNS server, whose IP
address is spoofed to be that of the global DNS server, and the DNS record is also falsified; packet 7
is local DNS server’s response to the user machine after the (spoofed) response is received from the
global DNS server.
45
5.3 Labeling DNS Sessions
This attack is session-based. Each session, as shown in Table 5.1, is marked by the
DNS query packet sent by the user machine. The malicious impact can only be generated
when the attack session is complete. Therefore, the label is also assigned based on session:
if the session is a successful DNS cache poisoning attack, then the session is labeled as
malicious; otherwise, the session is labeled as benign.
Luckily, the malicious and benign raw data in this use case have already been separated.
The malicious and benign data are generated and collected separately. The benign raw
data does not have malicious sessions involved, and the malicious raw data does not have
unsuccessful attacks or benign sessions involved. Therefore, during data processing, the
benign and malicious raw data can be processed and labeled separately.
46
5.4 Feature Extraction and Data Sample Representation
1 def process_malicious(malicious):
2 malicious_bytes=list()
3 i=0
4 j=-1
5 for item in tqdm(malicious,ascii=True,desc="Processing malicious data"):
6 cap=pyshark.FileCapture(item,display_filter=’dns and not tcp’,use_json=True,include_raw=True)
7 try:
8 while(True):
9 pkt=cap.next()
10 if pkt.ip.src_host==’192.168.100.18’ and pkt.ip.dst_host==’192.168.100.50’: # chop by session
11 malicious_bytes.append([])
12 j+=1
13 malicious_bytes[j].append(np.array(list(int(ele) for ele in pkt.get_raw_packet()[14:26]+pkt.
get_raw_packet()[34:54]))) # select bytes of interest
14 i+=1
15 except StopIteration:
16 pass
17 cap.close()
18
19 np.save(os.path.join(’data’,’malicious_bytes.npy’),np.array(malicious_bytes,dtype= object),allow_pickle=
True)
Therefore, instead of using the whole packet, 32 bytes are chosen from every DNS
packet, as listed in Table 5.2. All bytes in the lowest ETH layer are omitted. They are
excluded purposefully to rule out the impact of MAC addresses: the MAC address may
be treated as signatures to detect malicious packets. The chosen 32 bytes include bytes in
the IP layer, the UDP layer, and part of the DNS layer. Only part of DNS layer data is used
because the query and records are excluded. Those sections contain the domain names
and IP addresses, which are excluded for similar reasons as above: it is not desired that
the neural network learns the “malicious” domain. After the packet data processing, every
packet is represented as a fixed-length sequence of 32 integer numbers ranging from 0
to 255. The whole captured network traffic is represented as a sequence of 32-integer
vectors, and the sequence length is equal to the number of packets in the network log.
Now that the whole network log can be represented as a long sequence of many
sequences of vectors containing 32 integers, the next step is to chop the whole log (long
sequence) by session. Specifically, whenever a DNS query packet from the user machine
is observed, the sequence is chopped right before it. As a result, a number of variate-
length sequences can be generated, and the sequence length here is equal to the number
of packets in one session.
The following step is to unify the data representation. Specifically, the variate-length
sequence should be processed into fixed-length sequence(s). Generally, there are two
ways to do this. One is to find the longest sequence, and pad all other shorter sequences to
the same length with dummy data. The other way is to apply sliding windows, so that long
sequences can be converted to several shorter sequences. Both have been proven to be
effective in multiple scenarios in deep learning, and, in this use case, the sliding window
approach is used. After applying sliding window, a number of fixed-length sequences
47
5.4 Feature Extraction and Data Sample Representation
are generated, and the sequence length here is equal to the sliding window length. Each
element inside is a vector containing 32 integers.
The last step is to change the integers. Those integers are converted from bytes, but integers may
not be the most suitable representation for the byte. For example, the flags field in the DNS layer
has two bytes, but these two bytes are made up of control bits. As a result, close numbers do not
necessarily mean that the functionalities represented are close. For example, in the case of response
DNS packet, the lower byte (eight bits) in the flags field of DNS layer contains five control flags: bit
0 to 3 indicate the reply code; bit 4 indicates whether to accept non-authenticated data or not; bit 5
indicates whether the answer is authenticated or not; bit 6 is reserved in response DNS packet; and
bit 7 indicates whether recursion is available or not. Incidentally, 0x00, where all bits are zero, and
0x10, where all but bit 4 are zero, only differ in one bit, but the numeral difference is 16. In a word,
bit representation might be more natural in this use case. Correspondingly, the integers in the vectors
48
5.5 Data Set Construction
1 def int2bin(num,padding=False,pad_to=8):
2 lst=[]
3 while num>1:
4 lst.insert(0,num%2)
5 num=num//2
6 lst.insert(0,num)
7 if padding:
8 while len(lst)<pad_to:
9 lst.insert(0,0)
10 return lst
Another
session
...
49
5.5 Data Set Construction
50
5.7 Parameter Tuning
1 full_set=set()
2
3 # remove duplicates
4 tmp = []
5 for ele in X_mal_tmp:
6 tmp.append(tuple(ele.reshape((window_size*32,)).tolist()))
7 full_set = full_set | set(tmp)
8 X_mal_tmp = np.array(list(set(tmp)),dtype=np.uint8).reshape((len(tmp),window_size,32))
9
10 tmp = []
11 for ele in X_ben_tmp:
12 tmp.append(tuple(ele.reshape((window_size*32,)).tolist()))
13 full_set = full_set | set(tmp)
14 X_ben_tmp = np.array(list(set(tmp)),dtype=np.uint8).reshape((len(tmp),window_size,32))
15
16 # print dataset statistics before removing double-dipping
17 print("Original size of dataset for k = %d" % k)
18 print("Malicious: %d" % len(X_mal_tmp))
19 print("Benign: %d" % len(X_ben_tmp))
20 print("Merged size: %d" % len(full_set))
21
22 if len(full_set)!=(X_ben_tmp.shape[0]+X_mal_tmp.shape[0]): # check the necessity for removing double-
dipping
23 # remove double-dipping
24 X_mal = []
25 X_ben = []
26 print("Double dipping exist (%d!=%d)! Remove double dipping..."%(len(full_set),(X_ben_tmp.shape[0]+
X_mal_tmp.shape[0])))
27 for item in full_set:
28 item=np.array(item,dtype=np.uint8).reshape((window_size,32))
29 if (item in X_ben_tmp) and (item in X_mal_tmp):
30 continue
31 elif item in X_ben_tmp:
32 X_ben.append(item)
33 elif item in X_mal_tmp:
34 X_mal.append(item)
35 if (len(X_ben) % 1000 == 0) or (len(X_mal) % 1000 == 0):
36 print("############################")
37 print(str(k) + " :")
38 print("Malicious: %d" % (len(X_mal)))
39 print("Benign: %d" % (len(X_ben)))
40 print("Progress: %d/%d" %
41 (len(X_mal)+len(X_ben), len(full_set)))
42 X_ben=np.array(X_ben)
43 X_mal=np.array(X_mal)
44 else:
45 X_ben=X_ben_tmp
46 X_mal=X_mal_tmp
47
48 # remove unused variable to save memory
49 del(X_ben_tmp)
50 del(X_mal_tmp)
51 gc.collect()
52
53 # print data set statistics after removing double-dipping
54 print("###########################")
55 print("Double dipping removing finished.")
56 print("Malicious: %d" % len(X_mal))
57 print("Benign: %d" % len(X_ben))
Fully
Input Convolution Max pooling Convolution Max pooling Dropout
Flatten layer connected Output layer
layer layer 1 layer 1 layer 2 layer 2 layer
layer
Number of units: Number of units:
Pool size: Pool size: Number of units: Dropout rate:
C1 C2
2*2 2*2 F D
Kernel size: Kernel size:
K1 K2
Figure 5.3: Neural network structure for DNS cache poisoning detection.
51
5.8 Evaluation results
The first set of parameters, which are about the sliding window, includes the window
length w_size and window step w_step. Window length defines the length of the sliding
window, and the window step defines the size of the sliding window’s movement. For
example, in Figure 5.2, the window size is six, and the window step is one. For w_size,
{4, 6, 8, 10, 12} are tried; for st, {1, 2, 4, 6, 8} are tried. Note that, depending on the first
set of parameters, the size of data sets can differ.
The second set of parameters, which are model hyperparameters, includes the number
of units in some hidden layers (the convolutional layer 1, 2 and the fully connected
layer, N _hidden ∈ {(C1 = 16, C2 = 16, F = 8), (C1 = 32, C2 = 16, F = 8), (C1 =
32, C2 = 32, F = 16), (C1 = 32, C2 = 32, F = 8), (C1 = 64, C2 = 32, F = 16), (C1 =
64, C2 = 64, F = 16)}), the kernel size (K_size ∈ {(K1 = 2 ∗ 2, K2 = 2 ∗ 2), (K1 =
3 ∗ 3, K2 = 3 ∗ 3), (K1 = 4 ∗ 4, K2 = 4 ∗ 4)}) in each CNN layer, and the dropout rate
(D ∈ {0.05, 0.1, 0.15, 0.2, 0.25}) in the dropout layer.
Code 5.5 shows how the classification neural network is built. Doubtlessly, the second set of
parameters are necessary for model building. The window size from the first set of parameters is also
needed for model building, because it is directly related to the shape of input data samples to the neural
network.
Because it is unknown that which values will result in the best results beforehand, all parameter
combinations are tried. Also, because 4-fold cross validation is applied, a total of 5∗5∗6∗3∗5∗4 = 9000
models are trained. The best model is chosen based upon the average performance on the validation
set across all four folds.
52
5.9 Model Deployment
53
5.10 Remaining Issues
54
5.10 Remaining Issues
55
Chapter 6 AI Detects PC Malware
57
6.3 Data Processing
58
6.4 Model Training
0×400520
0×40054f
0×400576
0×00400576: leave
0×00400577: ret
4 block = proj.factory.block(proj.entry)
5 blocks = {}
6 node_features = []
7 G = nx.DiGraph()
8 idx = 0
9 for n in cfg.graph.nodes():
10 blocks[hex(n.addr)] = idx
11 G.add_node(idx)
12 idx += 1
13 uint8 = []
14 if n.block != None:
15 block_instructions = n.block.capstone.__str__()
16 vector = n.block.bytes.hex()
17 b = bytearray.fromhex(vector)
18 for i in range(len(b)):
19 uint8.append(b[i])
20 uint8.extend([256]*(feature_dim - len(uint8)))
21 node_features.append(uint8)
22 for k, v in cfg.graph.edges():
23 G.add_edge(blocks[hex(k.addr)], blocks[hex(v.addr)])
24 return G
59
6.4 Model Training
1 class CNN_MAL(tf.keras.Model):
2 def __init__(self, num_classes):
3 """Initializes the CNN model
4 :param num_classes: The number of classes in the dataset.
5 """
6 super(CNN_MAL, self).__init__(name="CNN_MAL")
7 self.num_classes = num_classes
8 self.image_shape = [-1, config.img_width, config.img_height, config.channel]
9 def __graph__():
10 self.preprocess_input = tf.keras.applications.mobilenet_v2.preprocess_input
11 self.base_model = tf.keras.applications.resnet50(input_shape=self.image_shape,
12 include_top=False,
13 weights=’imagenet’)
14 self.base_model.trainable = False
15 self.global_average_layer = tf.keras.layers.GlobalAveragePooling2D()
16 # Dropout, to avoid overfitting
17 self.dropout_layer = tf.keras.layers.Dropout(0.2)
60
6.4 Model Training
18 # Readout layer
19 self.prediction_layer = layers.Dense(num_classes)
20
21 __graph__()
22
23 def call(self, x_input, training=False):
24 x_input = self.preprocess_input(x_input)
25 x = self.base_model(x_input, training=False)
26 x = self.global_average_layer(x)
27 x = self.dropout_layer(x)
28 logits = self.prediction_layer(x)
29 return logits
We additionally apply layer normalization after the LSTM Layer. Since the sequence of one
binary is divided into several subsequences, the sequence-level feature xi are summarized by adding
Pxj
up the features of all subsequences xi = j i . Finally, dropout and classification layer are employed.
1 def __init__(self, num_classes, rnn_type):
2 """Initializes the RNN model
3 :param num_classes: The number of classes in the dataset.
4 :param rnn_type: basic rnn, rnn, gru
5 """
6 super(RNN_MAL, self).__init__(name="RNN_MAL")
7 self.num_classes = num_classes
8
9 def __graph__():
10 self.embedding_layer = layers.Embedding(config.max_features, config.hidden_units)
11 if rnn_type == ’lstm’:
12 self.rnn = LSTM(config.hidden_units, return_state=False, return_sequences=True)
13 elif rnn_type == ’rnn’:
14 self.rnn = RNN(config.hidden_units, return_state=False, return_sequences=True)
61
6.4 Model Training
62
6.4 Model Training
11 Z = self.gcnn_layer(Z, self.weight_matrix[i])
12 Z1_h.append(Z)
13 Z1_h = tf.concat(Z1_h, 1)
14 return Z1_h
The SortPooling layer extracts and sorts the vertex features based on the structural roles within
the graph [94]. Here, the extracted vertex features are the continuous Weisfeiler Lehman (WL) colors.
As with the DGCNN model, after we sort all the vertices using the last output layer’s output, the topk
sorted vertices are send to s convolutional layer and a classification layer.
1 def sort_pooling_layer(self, gcnn_out):
2 def sort_a_graph(index_span):
3 indices = tf.range(index_span[0], index_span[1])
4 graph_feature = tf.gather(gcnn_out, indices)
5
6 graph_size = index_span[1] - index_span[0]
7 k = tf.cond(self.k > graph_size, lambda: graph_size, lambda: self.k)
8 top_k = tf.gather(graph_feature, tf.nn.top_k(graph_feature[:, -1], k=k).indices)
9
10 zeros = tf.zeros([self.k - k, sum(self.gcnn_dims)], dtype=tf.float32)
11 top_k = tf.concat([top_k, zeros], 0)
12 return top_k
13
14 sort_pooling = tf.map_fn(sort_a_graph, self.graph_indexes, dtype=tf.float32)
15 return sort_pooling
16
17 def cnn1d_layers(self, inputs):
18 total_dim = sum(self.gcnn_dims)
19 graph_embeddings = tf.reshape(inputs, [-1, self.k * total_dim, 1]) # (batch, width, channel)
20 first_conv = self.first_conv_layer(graph_embeddings)
21 first_conv_pool = self.first_pooling_layer(first_conv)
22
23 second_conv = self.second_conv_layer(first_conv_pool)
24 return second_conv
25
26 def classification_layer(self, inputs):
27 cnn1d_embed = self.flatten_layer(inputs)
28 outputs = self.dense_layer(cnn1d_embed)
29 return outputs
63
6.5 Model Deployment
64
6.6 Remaining Issues
very effective at detecting known malware but generally unproductive at detecting previously unknown
malware. Attackers can reorder the malware code or insert useless code to avoid this kind of detection.
Figure 6.5 shows a slice of code from a well-known malware family distributed by APT threat actor
OceanLotus on the left, and a YARA signature to detect it on the right [80].
Dynamic analysis executes the programs in a virtual environment to monitor their behaviors
and observe their functionality. Several tools can be used to safely execute suspicious programs:
sandbox (e.g. Cuckoo, DefenseWall, Bufferzone); virtual machine (e.g. HoneyMonkey, VGround);
emulator (e.g. TTAnalyze, K-Tracer). In general, it provides an emulated environment to run the
suspicious applications. Features obtained by dynamic analysis are API calls, system calls, registry
changes, memory writes, network patterns, etc [18, 57, 72]. Although dynamic analysis is potentially
comprehensive, it is more computationally expensive and less widely used. This analysis technique is
more time consuming and has high false positive. The malware starts an early check and immediately
exits if it runs on virtual machines. Even worse, some of the malware intentionally exhibit some benign
behaviors to trigger human analysts to draw incorrect conclusions about the intent of the malware.
In static-analysis-based malware detection, before a suspicious program file is executed, certain
static features are extracted from the executable file to determine whether the file is malicious or
not. Some work make use of the binary file itself as indicators to detect the malware [17, 62].
The characteristics of the binary files, such as PE import features, metadata, and strings, are also
ubiquitously applied in malware detection [19]. Others leverage reverse engineering to understand
the programs’ architecture. Reverse engineering is employed to disassemble the program to extract
high-level representation, including instruction flow graph, control flow graphs, call graph, and opcode
sequences [19, 52, 86]. One advantage of static analysis is that it is usually substantially faster than
dynamic analysis. Another advantage is that it can achieve better coverage.
Even though the performance of the models trained in this use case looks very good, adversarial
attacks should be considered when we deploy these models in real-world. For example, code obfusca-
tion is effective to evade signature-based detection because it could significantly change the syntactic
of original malware. Similarly, obfuscated binaries might also be able to evade DL models. Code
obfuscation tools serve two main purposes: (a) to protect intellectual properties; (b) to evade malware
detection systems. There are a variety of code obfuscation techniques. A basic requirement of code
obfuscation is that the program semantics must be preserved. Traditionally, attackers use obfuscation
tricks, including dead-code insertion, register reassignment, subroutine reordering, instruction sub-
stitution and so on, to morph their malware to evade malware detection [15, 92]. Here we list the
definitions of some widely-used obfuscation tricks:
Semantic Nops Insertion: Inserting certain ineffective instructions, e.g., N OP , to the original
65
6.7 Code and Data Resources
66
Chapter 7 AI Detects Code Similarity
Code (7.2) Code generated using O0 Code (7.3) Code generated using O1
compiler option. compiler option.
Code 7.2 and Code 7.3 shows the code emitted using GCC 7.5.0 at two different optimization levels.
As shown in the figures, the emitted codes are very different, and it is not a trivial to determine whether
the two binary code snippets are compiled from the same source code. By generating code using
different optimization levels and compilers, training data with the ground truth can be obtained with
minimal amount of human efforts.
To summarize, the raw data are obtained by compiling source code with different compiler setting.
The compiled program can be further divided into function, basic blocks, and/or loops. During the
model training phase, the function can be identified using the symbol table, and during the production
phase, the function are identified using existing tools, e.g., IDAPro.
68
7.3 Data Processing
89 54 04 40 48
mov %edx,0x40(%rsp,%rax,1) 83 c0 04 48 83
add $0x4,%rax
89 54 04 40 48 83 c0 04 48 … f8 18 0f 85 3b
cmp $0x18,%rax
jne 0x74d 07 00 00 - -
- - - - -
the structures is better than the other, and one should select an appropriate representation based on
empirical study.
The raw bytes can naturally be mapped to decimal integer values, that is, 0 to 255, to enable
the computation of the deep learning model. In case the padding is needed, the padding values can
be mapped to an integer that is outside of the range. The downside of this mapping is the numerical
characteristics will be carried to the deep learning model, which is undesired, because one cannot
claim byte 0x10 is numerically smaller than 0xff. One solution to this is to map the bytes into one-hot
vectors, but this will greatly increase the size of the model.
69
7.3 Data Processing
(b) ACFG generated from the CFG shown in Figure 7.4a. (c) The corresponding source code.
Figure 7.4: The illustration of generating ACFG.
70
7.4 Model
Figure 7.4 shows an example of the process to generate ACFG from a standard CFG. As shown
in Figure 7.4a, the node in the graph is the basic block, which always end with a branch transferring
instruction. Then, the 6 basic block features mentioned above can be extracted easily through simple
program analysis, and 2 graph structural features can be calculated through graph analysis. Together
the two kinds of features will form a node feature vector that is 8 elements in length for each node.
The generated ACFG is ready to feed into the deep learning model.
ACFG is architectural independent. In other word, the challenge of the cross-architecture code
similarity detection is alleviated by abstracting the information into ACFG. However, the downside is
that since the node features selected based on heuristics and domain knowledge, some implicit patterns
could be neglected.
7.4 Model
Input Pair
Sample 1 Sample 2
GNN GNN
Backbone Backbone
Distances /
Contrastive Loss
Figure 7.5: The Siamese model architecture using GNN as the backbone.
To obtain quantified similarity score, this use case applies deep learning to learn embedding
71
7.4 Model
vectors for given inputs. Just as other deep learning based similarity detection applications such as
face recognition, where the same person’s faces result in similar embedding vectors, the binary codes
compiled from the same function should also result in similar vectors. To achieve this goal, one of
the most popular model architecture to use is the siamese network [10], which is initially proposed to
verify hand-written signature. Therefore, in order to use the siamese network to train a model, labeled
data is needed.
Regarding the backbone of the deep learning model, it is mostly depending upon the data
representation. For sequence data, one can use convolutional neural network (CNN); whereas for
graph data, one can use GNN. The focus of this use case is the graph model, as the CNN backbone is
rather straightforward. The diagram for the model architecture is shown in Figure 7.5, and in the rest
of this section, the GNN backbone and the siamese network will be introduced.
72
7.4 Model
1 def graph_embed(X, msg_mask, N_x, N_embed, N_o, iter_level, Wnode, Wembed, W_output, b_output):
2 node_val = tf.reshape(tf.matmul( tf.reshape(X, [-1, N_x]) , Wnode),
3 [tf.shape(X)[0], -1, N_embed])
4
5 cur_msg = tf.nn.relu(node_val)
6 for t in range(iter_level):
7 # Message convey
8 Li_t = tf.matmul(msg_mask, cur_msg)
9 # Complex Function
10 cur_info = tf.reshape(Li_t, [-1, N_embed])
11 for Wi in Wembed:
12 if (Wi == Wembed[-1]):
13 cur_info = tf.matmul(cur_info, Wi)
14 else:
15 cur_info = tf.nn.relu(tf.matmul(cur_info, Wi))
16 neigh_val_t = tf.reshape(cur_info, tf.shape(Li_t))
17 # Adding
18 tot_val_t = node_val + neigh_val_t
19 # Nonlinearity
20 tot_msg_t = tf.nn.tanh(tot_val_t)
21 cur_msg = tot_msg_t
22
23 g_embed = tf.reduce_sum(cur_msg, 1)
24 output = tf.matmul(g_embed, W_output) + b_output
25
26 return output
Python code1 released by the authors of Gemini. As shown in Figure 7.6, in this implementation,
variable X is the tensor containing all the node features for all nodes, while all edge information is
stored in msg_mask, which is a binary matrix (e.g., elements are either 0 or 1).
Codes in line 2 to line 5 correspond to the W xn part in Equation 7.1 in the first iteration to
produce the embedding of the current node in the beginning. Then start from line 6, the loop of each
iteration (i.e., hop) begins. Notice that the message aggregation is done by multiplying a message
mask tensor (msg_mask) to the node feature/embedding tensor (cur_msg). The message aggregation
operation is shown in line 8. The rest of the loop body, from line 10 to 21, are straightforward, which
is implementing the other parts of the Equation 7.1. At the end, in line 23, the node embedding are
summed together to produce the embedding for the entire graph, and in line 24, one more linear layer
is used to propagate the information.
As described earlier, the last layer of the GNN backbone (e.g., a pooling layer) will output the
embedding of the graph by aggregating node representation. Next section will introduce the whole
network structure to learn from a pair of similar/dissimilar binary code.
1https://github.com/xiaojunxu/dnn-binary-code-similarity/blob/master/graphnnSiamese.py
73
7.4 Model
74
7.5 Code, Data and Other Issues
Code 7.5: A code snippet from Gemini to illustrate forward inference of the Siamese network.
75
7.5 Code, Data and Other Issues
by the two paper. The reason is that α-Diff is doing code similarity detection on program-level, and
Gemini is on function-level. Therefore, readers should select model based on their own experiments
and needs. In short, graph model relies more on program analysis domain knowledge, but it is
easier to be extended and further customized compared to the sequence model. Sequence model is
usually easier to implement and faster to train, but since it is taking encoded raw bytes as input, less
customization is available.
76
Chapter 8 AI Conducts Malware Clustering
behaviors of different malware samples clearly. Well-chosen features lay a solid foundation for precise
malware clustering. In the last step, a clustering algorithm is used to sort malware samples into
clusters based on the extracted features.
The malware clustering technique described in this chapter utilizes dynamic analysis to gather
information about malware samples. subsection 8.4.1 gives more details on how the analysis is done.
Previous research has shown that static analysis has its drawback when dealing with runtime packing
code and complex obfuscations, which are often used in sophisticated malware. Also, the possibility
of writing semantically-equal programs that are greatly different in their code makes it a must to deploy
dynamic analysis. Taint tracking is integrated during the dynamic analysis process to obtain important
features that better capture the behavior of malware samples. The output of the dynamic analysis phase
is an execution trace with taint information. The trace is then summarized into a behavior profile,
an abstraction of the execution trace which contains information around system-call related resources,
such as files or registry keys. Next, a behavior profile is transformed into a feature set. The output
of this step is a set of features in a form that is suitable for the clustering algorithm. subsection 8.4.2
gives more details on how the transformation is done. In the last step, scalable clustering of malware
samples is done by using an algorithm based on locality-sensitive hashing (LSH).
78
8.4 Feature Extraction
where O is the set of all OS objects, OP is the set of all OS operations, Γ ⊆ (O × OP ) is a relation
assigning one or several operations to each object, and ∆ ⊆ ((O × OP ) × (O × OP )) represents the
set of dependences. CV is the set of all compare operations of type label-value, while CL is the set of
all compare operation of type label-label. ΘCmpV alue ⊆ (CV × O) is a relation assigning label-value
compare operations to an OS object. ΘCmpLabel ⊆ (CL × O × O) is a relation assigning label-label
compare operations to the two appropriate OS objects."
In general, system calls are treated as taint sources in the taint tracking system. To be more
precise, out-arguments and return values of all system calls are tainted. Three types of information
are included inside a behavior profile:
79
8.5 Scalable Clustering
System Call Dependences. Every in-argument of the system calls is checked. If tainted, a
dependence between the taint origin system call and the current system call is created.
Control Flow Dependences. Compare instructions that involve tainted data (result of system
calls) are recorded. Both label-value comparison (comparison between untainted and tainted
value) and label-label comparison (comparison between 2 tainted values) are summarized inside
behavior profiles.
Network Analysis. Analyze the relevant network behaviors.
Readers interested in carrying out the dynamic analysis on malware samples can refer to PANDA 1,
a Platform for Architecture-Neutral Dynamic Analysis, for extraction of system call traces and taint
information. Here, we list some available plugins that are relevant to our use case 2.
For each object oi ∈ O, and for each assigned opj ∈ OP |(oi , a) ∈ Γ, create a feature:
fij = “op|” + name(oi ) + “|” + name(opj )
For each dependence δi ∈ ∆ = ((oi1 , opi1 ), (oi2 , opi2 )), we create a feature:
fi = “dep|” + name(oi1 ) + “|” + name(opi1 ) + “ → ” + name(oi2 ) + “|” + name(oi2 )
For each label-value comparison θi ∈ ΘCmpV alue = (cmp, o), we create a feature:
fi = “cmp_value|” + name(o) + “|” + name(cmp)
For each label-label comparison θi ∈ ΘCmpLabel = (cmp, o1 , o2 ), we create a feature:
fi = “cmp_label|” + name(o1 ) + +“ → ” + name(o2 ) + “|” + name(cmp)
where name() is a function that returns the name of an OS object, operation, or comparison as string;
quotes denote a literal string; and ‘+’ concatenates two strings."
1https://github.com/panda-re/panda
2System call tracing: https://github.com/panda-re/panda/tree/dev/panda/plugins/syscalls2,
General tainting: https://github.com/panda-re/panda/tree/dev/panda/plugins/taint2,
Control flow tainting: https://github.com/panda-re/panda/tree/dev/panda/plugins/tainted_branch
80
8.5 Scalable Clustering
m1 m2
f1 1 0
f2 0 1
f3 0 0
f4 1 1
f5 0 1
Table 8.1: Boolean matrix representing two malware samples.
[81]. The result of LSH presents a set consisting of pairs of similar malware samples. The set of
pairs is then sorted by similarity, which allows us to produce a single-linkage hierarchical clustering.
Figure 8.3 shows the process of finding the set consisting of similar pairs of malware samples.
81
8.5 Scalable Clustering
82
8.5 Scalable Clustering
columns hash into the same bucket for at least one of the hash functions, we treat the two columns,
which represent two malware samples, as a candidate pair.
Banding techniques. If we have a minhash signatures matrix, an effective way to choose the hashings
is to divide the signature matrix into b bands consisting of r rows each. For each band, there is a hash
function that takes vectors of r integers (the portion of one column within that band) and hashes them
to some large number of buckets. Same hash function can be used for all the bands, but a separate
bucket array for each band, so columns with the same vector in different bands will not hash to the
same bucket [43]. Figure 8.5 shows an example of a hash function for one band.
Parameter choosing. As stated in [43], “Suppose we use b bands of r rows each, and suppose that
a particular pair of documents have Jaccard similarity s. The probability the minhash signatures for
these documents agree in any one particular row of the signature matrix is s. We can calculate the
probability that these documents (or rather their signatures) become a candidate pair as follows:
1. The probability that the signatures agree in all rows of one particular band is sr .
2. The probability that the signatures disagree in at least one row of a particular band is 1 − sr .
83
8.6 Clusters Deployment
3. The probability that the signatures disagree in at least one row of each of the bands is (1 − sr )b .
4. The probability that the signatures agree in all the rows of at least one band, and therefore
become a candidate pair, is 1 − (1 − sr )b .
It may not be obvious, but regardless of the chosen constants b and r, this function has the form
of an S-curve, as suggested in Figure 8.6. The threshold is roughly where the rise is the steepest,
and for large b and r we find that pairs with similarity above the threshold are very likely to become
candidates, while those below the threshold are unlikely to become candidates – exactly the situation
we want. An approximation to the threshold is (1/b)1/r ." Therefore, since t = (1/b)1/r , by only
defining one parameter, a suitable t, with br = n, where n stands for the total number of features, we
will be able to find appropriate r and b.
Code implementation. Some code that can be used to implement LSH is shown below in Code 8.1
and Code 8.2.
84
8.7 Concluding Remarks
1 class LSH_Util(object):
2 def __init__(self, length, threshold):
1 class LSH(object):
3 self.length = length
2 def __init__(self, width=10, threshold=0.5):
4 self.threshold = threshold
3 self.width = width
5 self.bandwidth = self.get_bandwidth(length,
4 self.signer = MinHashSignature(width)
threshold)
5 self.hasher = LSH_Util(width, threshold)
6 def hash(self, sig):
6 self.hashmaps = [defaultdict(list)
7 for band in zip( * (iter(sig),) * self.
7 for _ in range(self.hasher.
bandwidth):
get_n_bands())]
8 yield hash(unicode(band))
8
9 def get_bandwidth(self, n, t):
9 def add_sample(self, s, label=None):
10 best = n, 1
10 # A label for this sample
11 minerr = float("inf")
11 if not label:
12 for r in range(1, n + 1):
12 label = s
13 try:
13
14 b = 1. / (t ** r)
14 # Get minhash signature
15 except:
15 sig = self.signer.sign(s)
16 return best
16
17 err = abs(n - b * r)
17 # Contruct hashmaps storing the LSH hashes
18 if err < minerr:
18 for band_idx, hshval in enumerate(self.
19 best = r
hasher.hash(sig)):
20 minerr = err
19 self.hashmaps[band_idx][hshval].append(
21 return best
label)
22 def get_threshold(self):
20
23 r = self.bandwidth
21 # Iterate through hashmaps to get candidate
24 b = self.length / r
pairs
25 return (1. / b) ** (1. / r)
22 ...
26 def get_n_bands(self):
27 return int(self.length / self.bandwidth)
Code 8.2: Code for implementing LSH 4 .
Code 8.1: Code for LSH utility . 3
Since the result of LSH only provides information regarding pairs that have similarity higher than
the threshold, information about subsequent clustering steps to merge clusters that have a similarity
below threshold is not readily available. To obtain an exhaustive hierarchical clustering, exact hierar-
chical clustering still needs to be performed on representatives of readily available clusters and all the
rest of the clusters. Although exact hierarchical clustering has a time complexity of O(n2 ), the result
is acceptable since the number of the representatives and the rest of the clusters are usually not large.
85
8.7 Concluding Remarks
86
Bibliography
[1] Yousra Aafer, Wenliang Du, and Heng Yin. “Droidapiminer: Mining Api-Level Features for
Robust Malware Detection in Android”. In: International Conference on Security and Privacy
in Communication Systems. Springer. 2013, pp. 86–103.
[2] Martin Abadi et al. “Control-flow Integrity Principles, Implementations, and Applications”. In:
ACM Transactions on Information and System Security (TISSEC) 13.1 (2009), pp. 1–40.
[3] Y. Xin et al. “Machine Learning and Deep Learning Methods for Cybersecurity”. In: IEEE
Access (2018).
[4] Android Emulator. Android.com, 2020. url: https://developer.android.com/studio/r
un/emulator.
[5] Steven Arzt et al. “Flowdroid: Precise Context, Flow, Field, Object-sensitive and Lifecycle-
aware Taint Analysis for Android Apps”. In: Acm Sigplan Notices 49.6 (2014), pp. 259–269.
[6] Erick Bauman, Zhiqiang Lin, Kevin W Hamlen, et al. “Superset Disassembly: Statically Rewrit-
ing x86 Binaries Without Heuristics.” In: NDSS. 2018.
[7] Ulrich Bayer et al. “Scalable, behavior-based malware clustering.” In: NDSS. 2009.
[8] Christopher M Bishop. “Pattern Recognition”. In: Machine Learning 128.9 (2006).
[9] Ivan Blekanov and Vasilii Korelin. “Hierarchical clustering of large text datasets using Locality-
Sensitive Hashing”. In: IWAIT Workshop. 2015.
[10] Jane Bromley et al. “Signature Verification Using a “Siamese” Time Delay Neural Network”. In:
International Journal of Pattern Recognition and Artificial Intelligence 7.04 (1993), pp. 669–
688.
[11] Haipeng Cai et al. “Droidcat: Effective Android Malware Detection and Categorization via
App-level Profiling”. In: IEEE Transactions on Information Forensics and Security 14.6 (2018),
pp. 1455–1470.
[12] Ligeng Chen, Zhongling He, and Bing Mao. “CATI: Context-Assisted Type Inference from
Stripped Binaries”. In: 2020 50th Annual IEEE/IFIP International Conference on Dependable
Systems and Networks (DSN). IEEE. 2020, pp. 88–98.
[13] Qi Alfred Chen, Zhiyun Qian, and Z Morley Mao. “Peeking into Your App without Actually
Seeing It: UI State Inference and Novel Android Attacks”. In: 23rd USENIX Security Symposium
(USENIX Security 14). 2014, pp. 1037–1052.
[14] Yoon-Ho Choi et al. “Using Deep Learning to Solve Computer Security Challenges: A Survey”.
In: Cybersecurity (2020).
[15] Mihai Christodorescu and Somesh Jha. “Testing Malware Detectors”. In: ACM SIGSOFT
Software Engineering Notes 29.4 (2004), pp. 34–44.
BIBLIOGRAPHY
[16] Zheng Leong Chua et al. “Neural Nets Can Learn Function Type Signatures from Binaries”.
In: 26th USENIX Security Symposium USENIX Security 17). 2017, pp. 99–116.
[17] Zhihua Cui et al. “Detection of Malicious Code Variants Based on Deep Learning”. In: IEEE
Transactions on Industrial Informatics 14.7 (2018), pp. 3187–3196.
[18] Dahl et al. “Large-scale Malware Classification Using Random Projections and Neural Net-
works”. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing
- Proceedings (2013), pp. 3422–3426.
[19] Leonardo De La Rosa et al. “Efficient Characterization and Classification of Malware Using
Deep Learning”. In: Proceedings - Resilience Week 2018, RWS 2018 (2018), pp. 77–83.
[20] Erik Derr. axplorer. 2017. url: https://github.com/reddr/axplorer.
[21] Steven HH Ding, Benjamin CM Fung, and Philippe Charland. “Asm2vec: Boosting Static
Representation Robustness for Binary Clone Search Against Code Obfuscation and Compiler
Optimization”. In: 2019 IEEE Symposium on Security and Privacy (SP). IEEE. 2019, pp. 472–
489.
[22] dlgroupuoft. PScout. 2018. url: https://github.com/dlgroupuoft/PScout.
[23] Min Du et al. “Deeplog: Anomaly detection and diagnosis from system logs through deep learn-
ing”. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications
Security. 2017, pp. 1285–1298.
[24] Chris Eagle. The IDA pro book. No Starch Press, 2011.
[25] Daniel R Ellis et al. “A Behavioral Approach to Worm Detection”. In: Proceedings of the 2004
ACM workshop on Rapid malcode. 2004, pp. 43–53.
[26] William Enck et al. “Taintdroid: An Information-flow Tracking System for Realtime Privacy
Monitoring on Smartphones”. In: ACM Transactions on Computer Systems (TOCS) 32.2 (2014),
pp. 1–29.
[27] Feature (Machine Learning). accessed: 2022-1-07. Wikipedia, Wikipedia Foundation. url:
https://en.wikipedia.org/wiki/Feature_(machine_learning).
[28] Feature Vector. Accessed: 2022-01-10. url: https://brilliant.org/wiki/feature-vec
tor/.
[29] Adrienne Porter Felt et al. “Android Permissions Demystified”. In: Proceedings of the 18th
ACM conference on Computer and communications security. 2011, pp. 627–638.
[30] Adrienne Porter Felt et al. “Permission Re-Delegation: Attacks and Defenses.” In: USENIX
Security Symposium. Vol. 30. 2011, p. 88.
[31] Qian Feng et al. “Scalable Graph-based Bug Search for Firmware Images”. In: Proceedings of
the 2016 ACM SIGSAC Conference on Computer and Communications Security. 2016, pp. 480–
491.
[32] GCC, the GNU Compiler Collection. Accessed: 2021-10-01. url: https://gcc.gnu.org.
88
BIBLIOGRAPHY
[33] Liangyi Gong et al. “Experiences of Landing Machine Learning onto Market-scale Mobile
Malware Detection”. In: Proceedings of the Fifteenth European Conference on Computer
Systems. 2020, pp. 1–14.
[34] Liangyi Gong et al. “Systematically Landing Machine Learning onto Market-Scale Mobile
Malware Detection”. In: IEEE Transactions on Parallel and Distributed Systems 32.7 (2020),
pp. 1615–1628.
[35] Michael Grace et al. “Riskranker: Scalable and Accurate Zero-day Android Malware Detection”.
In: Proceedings of the 10th international conference on Mobile systems, applications, and
services. 2012, pp. 281–294.
[36] Wenbo Guo et al. “DEEPVSA: Facilitating Value-set Analysis with Deep Learning for Post-
mortem Program Analysis”. In: 28th USENIX Security Symposium (USENIX Security 19). 2019,
pp. 1787–1804.
[37] Michiel Hermans and Benjamin Schrauwen. “Training and analysing deep recurrent neural
networks”. In: Advances in neural information processing systems 26 (2013), pp. 190–198.
[38] Hyungjoon Koo et al. “Semantic-aware Binary Code Representation with BERT”. In: arXiv
preprint arXiv:2106.05478 (2021).
[39] Chris Lattner and Vikram Adve. “LLVM: A Compilation Framework for Lifelong Program
Analysis & Transformation”. In: International Symposium on Code Generation and Optimiza-
tion, 2004. CGO 2004. IEEE. 2004, pp. 75–86.
[40] Quoc Le and Tomas Mikolov. “Distributed Representations of Sentences and Documents”. In:
International Conference on Machine Learning. PMLR. 2014, pp. 1188–1196.
[41] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. “Deep Learning”. In: Nature 521.7553
(2015), pp. 436–444.
[42] Young Jun Lee et al. “Learning Binary Code with Deep Learning to Detect Software Weakness”.
In: KSII the 9th International Conference on Internet (ICONI) 2017 Symposium. 2017.
[43] Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. Mining of Massive Datasets. 2nd.
USA: Cambridge University Press, 2014. isbn: 1107077230.
[44] Xuezixiang Li, Qu Yu, and Heng Yin. “PalmTree: Learning an Assembly Language Model for
Instruction Embedding”. In: arXiv preprint arXiv:2103.03809 (2021).
[45] Yujia Li et al. “Graph Matching Networks for Learning the Similarity of Graph Structured
Objects”. In: International Conference on Machine Learning. PMLR. 2019, pp. 3835–3845.
[46] Bingchang Liu et al. “αdiff: Cross-version Binary Code Similarity Detection with Dnn”. In:
Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engi-
neering. 2018, pp. 667–678.
[47] Liu Liu et al. “Insider Threat Identification using The Simultaneous Neural Learning of Multi-
source Logs”. In: IEEE Access 7 (2019), pp. 183162–183176.
89
BIBLIOGRAPHY
[48] Liu Liu et al. “Unsupervised Insider Detection through Neural Feature Learning and Model
Optimisation”. In: International Conference on Network and System Security. Springer. 2019,
pp. 18–36.
[49] Aravind Machiry et al. “Using Loops for Malware Classification Resilient to Feature-unaware
Perturbations”. In: ACM International Conference Proceeding Series. Association for Comput-
ing Machinery, Dec. 2018, pp. 112–123.
[50] Mark McDermott. “Presentation: The ARM Instruction Set Architecture”. In: (2008). url:
http://users.ece.utexas.edu/~valvano/EE345M/Arm_EE382N_4.pdf.
[51] John H. McDonald. Spearman Rank Correlation. 2019. url: http://www.biostathandboo
k.com/spearman.html.
[52] Niall McLaughlin et al. “Deep Android Malware Detection”. In: Proceedings of the 7th ACM
Conference on Data and Application Security and Privacy (2017), pp. 301–308.
[53] Mining of massive datasets. url: http://www.mmds.org/.
[54] Robert Monarch. Human-in-the-Loop Machine Learning. Manning Publications Corp., 2021.
[55] Akira Mori et al. “A Tool for Analyzing and Detecting Malicious Mobile Code”. In: ICSE.
Vol. 2006. May 2006, pp. 831–834.
[56] Nour Moustafa and Jill Slay. “UNSW-NB15: A Comprehensive Data Set for Network Intrusion
Detection Systems (UNSW-NB15 Network Data Set)”. In: 2015 Military Communications and
Information Systems Conference (MilCIS). IEEE. 2015, pp. 1–6.
[57] Robin Nix and Jian Zhang. “Classification of Android Apps and Malware Using Deep Neural
Networks”. In: Proceedings of the International Joint Conference on Neural Networks 2017-
May (2017), pp. 1871–1878.
[58] Jannik Pewny et al. “Leveraging Semantic Signatures for Bug Search in Binary Programs”. In:
Proceedings of the 30th Annual Computer Security Applications Conference. 2014, pp. 406–
415.
[59] Anh Viet Phan, Minh Le Nguyen, and Lam Thu Bui. “Convolutional Neural Networks Over
Control Flow Graphs for Software Defect Prediction”. In: 2017 IEEE 29th International Con-
ference on Tools with Artificial Intelligence (ICTAI). IEEE. 2017, pp. 45–52.
[60] Yuval Pinter, Robert Guthrie, and Jacob Eisenstein. “Mimicking Word Embeddings using
Subword RNNs”. In: arXiv preprint arXiv:1707.06961 (2017).
[61] Samira Pouyanfar et al. “A Survey on Deep Learning: Algorithms, Techniques, and Applica-
tions”. In: ACM Computing Surveys (CSUR) 51.5 (2018), pp. 1–36.
[62] Edward Raff et al. “Malware Detection by Eating a Whole exe”. In: Workshops at the Thirty-
Second AAAI Conference on Artificial Intelligence. 2018.
[63] Random Forest. Accessed: 2022-1-11. url: https://en.wikipedia.org/wiki/Random_fo
rest.
90
BIBLIOGRAPHY
[64] Suhita Ray. “Disease Classification within Dermascopic Images using Features Extracted by
Resnet50 and Classification through Deep Forest”. In: arXiv preprint arXiv:1807.05711 (2018).
[65] Scikit-learn. INRIA, 2010. url: https://scikit-learn.org/stable/index.html.
[66] sendTextMessage. Accessed: 2022-1-26. url: https://developer.android.com/referen
ce/android/telephony/SmsManager#sendTextMessage(java.lang.String,%5C%20ja
va.lang.String,%5C%20java.lang.String,%5C%20android.app.PendingIntent,%5
C%20android.app.PendingIntent,%5C%20long).
[67] Eui Chul Richard Shin, Dawn Song, and Reza Moazzezi. “Recognizing Functions in Binaries
with Neural Networks”. In: 24th USENIX Security Symposium (USENIX Security 15). 2015,
pp. 611–626.
[68] Sophisticated new Android Malware Marks the Latest Evolution of Mobile Ransomware. mi-
crosoft.com, 2020. url: https://www.microsoft.com/security/blog/2020/10/08/sop
histicated-new-android-malware-marks-the-latest-evolution-of-mobile-rans
omware/.
[69] Stamina Scalable Deep Learning Whitepaper. Accessed: 2021-09-30. url: https://www.in
tel.com/content/dam/www/public/us/en/ai/documents/stamina-scalable-deep-
learning-whitepaper.pdf.
[70] Andrew Sung et al. “Static Analyzer of Vicious Executables (SAVE)”. In: ACSAC. 2005.
[71] The DWARF Debugging Standard. DWARF Standards Committee, 2012. url: https://dwar
fstd.org.
[72] Tobiyama et al. “Malware Detection with Deep Neural Network Using Process Behavior”. In:
International Computer Software and Applications Conference 2 (2016), pp. 577–582. issn:
07303157.
[73] Ubuntu Software Packages. https://packages.ubuntu.com/bionic/. Accessed: 2021-09-
30.
[74] UI/Application Exerciser Monkey in Android Studio. Android.com, 2008. url: https://dev
eloper.android.com/studio/test/monkey.html.
[75] Ashish Vaswani et al. “Attention Is All You Need”. In: Advances in Neural Information Pro-
cessing Systems. 2017, pp. 5998–6008.
[76] Virusshare. Accessed: 2021-09-30. url: https://virusshare.com/.
[77] VirusTotal. Accessed: 2019-09-30. url: https://www.virustotal.com/.
[78] Zhilong Wang et al. “Identifying Non-Control Security-Critical Data in Program Binaries with
a Deep Neural Model”. In: arXiv preprint arXiv:2108.12071 (2021).
[79] Zhilong Wang et al. “Spotting Silent Buffer Overflows in Execution Trace through Graph Neural
Network Assisted Data Flow Analysis”. In: arXiv preprint arXiv:2102.10452 (2021).
91
BIBLIOGRAPHY
[80] What Is A Malware File Signature (And How Does It Work)? Accessed: 2021-09-30. url:
https://www.sentinelone.com/blog/what-is-a-malware-file-signature-and-ho
w-does-it-work/.
[81] Wikipedia. Locality-sensitive hashing — Wikipedia, The Free Encyclopedia. http://en.wik
ipedia.org/w/index.php?title=Locality-sensitive%20hashing&oldid=106294184
5.
[82] Michelle Y Wong and David Lie. “IntelliDroid: A Targeted Input Generator for the Dynamic
Analysis of Android Malware.” In: NDSS. 2016.
[83] Dong-Jie Wu et al. “Droidmat: Android Malware Detection through Manifest and Api Calls
Tracing”. In: 2012 Seventh Asia Joint Conference on Information Security. IEEE. 2012, pp. 62–
69.
[84] Wen-Chieh Wu and Shih-Hao Hung. “DroidDolphin: A Dynamic Android Malware Detection
Framework using Big Data and Machine Learning”. In: Proceedings of the 2014 Conference
on Research in Adaptive and Convergent Systems. 2014, pp. 247–252.
[85] XposedBridge. Rovo89, 2016. url: https://github.com/rovo89/Xposed%20Bridge/wik
i/Development-tutorial.
[86] Lifan Xu et al. “Hadm: Hybrid Analysis for Detection of Malware”. In: Proceedings of SAI
Intelligent Systems Conference. Springer. 2016, pp. 702–724.
[87] Wei Xu et al. “Largescale System Problem Detection by Mining Console Logs”. In: Proceedings
of SOSP’09 (2009).
[88] Xiaojun Xu et al. “Neural Network-based Graph Embedding for Cross-platform Binary Code
Similarity Detection”. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and
Communications Security. 2017, pp. 363–376.
[89] Ruipeng Yang et al. “NLSALog: An anomaly detection framework for log sequence in security
management”. In: IEEE Access 7 (2019), pp. 181152–181164.
[90] Tianda Yang et al. “Automated Detection and Analysis for Android Ransomware”. In: IEEE
7th International Symposium on Cyberspace Safety and Security. 2015.
[91] Donggeun Yoo and In So Kweon. “Learning Loss for Active Learning”. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, pp. 93–102.
[92] Ilsun You and Kangbin Yim. “Malware Obfuscation Techniques: A Brief Survey”. In: 2010
International conference on broadband, wireless computing, communication and applications.
IEEE. 2010, pp. 297–300.
[93] Lun-Pin Yuan, Peng Liu, and Sencun Zhu. “Recomposition vs. Prediction: A Novel Anomaly
Detection for Discrete Events Based On Autoencoder”. In: arXiv preprint arXiv:2012.13972
(2020).
[94] Muhan Zhang et al. “An End-to-end Deep Learning Architecture for Graph Classification”. In:
Thirty-Second AAAI Conference on Artificial Intelligence. 2018.
92
BIBLIOGRAPHY
[95] Yajin Zhou and Xuxian Jiang. “Dissecting Android Malware: Characterization and Evolution”.
In: 2012 IEEE Symposium on Security and Privacy. IEEE. 2012, pp. 95–109.
[96] Yajin Zhou et al. “Hey, You, Get off of My Market: Detecting Malicious Apps in Official and
Alternative Android Markets.” In: NDSS. 2012.
[97] Qingtian Zou et al. “Deep Learning for Detecting Network Attacks: An End-to-end Approach”.
In: IFIP Annual Conference on Data and Applications Security and Privacy. Springer. 2021,
pp. 221–234.
93