Malware Detection Using ANN
Malware Detection Using ANN
dynamic features
Keywords: Malware, Threat Detection, Neural Network, Static Features, Dynamic Features
Abstract: Cyber-security industry has been the home of various machine learning approaches meant to be more proac-
tive when it comes to new threats. In time, as security solutions matured, so did the way in which artificial
intelligence algorithms are being used for specific contexts. In particular, static and dynamic analysis of a
threat determines certain characteristics of an artificial intelligence algorithm (such as inference speed, mem-
ory usage) used for threat detection. While from a product point of view, static and dynamic analysis of a
threat target separate product features such as protection for static analysis and detection for dynamic anal-
ysis, the feature sets derived from analyzing threats in those two scenarios (static and dynamic analysis) are
complementary and could improve the accuracy of a model if used together. The current paper focuses on a
multi-layered approach that takes into consideration both static and dynamic analysis of a threat.
1 INTRODUCTION itive rate as they can not block anything. As such they
are usually allowed a certain level of false positive
The cyber-security industry has evolved along with if the detection rate is increased. Another important
the constant increase of attacks that led to more than observation is that the output of these models is usu-
1.2 billion1 known malware in 2023. Nowadays, most ally analyzed by a SOC2 team that determines if an
of the cyber-security solutions rely on different ma- attack was identified or not. This behavior led to a
chine learning algorithms for threat detection. phenomenon called alert fatigue - meaning that alerts
At the same time, cyber-security solutions evolved from these models accumulate to a point where they
in an attempt to cover various needs from both con- are hard to sort and analyze by security officers from
sumer and business markets. One of the major differ- companies.
ences in these cases is that while consumers are more Our paper focuses on a method that combines two
focused on protection (the role of a security solution types of models: based on statically extracted features
being to make sure that nothing bad happens to a sys- and models designed for detection, with the purpose
tem), the business solution also focuses on providing of reducing the alert fatigue phenomena while pre-
visibility around an attack. Most of the requirements serving a high detection rate. For this purpose we
that appear in the business market are a direct result of have used multiple malicious files that were analyzed
several compliance rules that enterprises have to obey. from a dual perspective:
For example, if a cyber-security attack succeeds on a • a static analyses perspective where we extract
bank, the bank is required to start an internal inves- features based on meta information we can extract
tigation that analyzes logs and additional attack arti- from the malicious files
facts to better understand the impact of that attack.
With this, requirements for a new type of machine • a dynamic analysis perspective where we extract
learning algorithm emerge, that focuses on the behav- features that reflect the malware behavior in run-
ior of an attack and uses logs or asynchronously ob- time.
tained events to create features that are further going The rest of the paper is organized as follows: sec-
to be used by the algorithm. These algorithms cannot tion 2 reveals similar research, section 3 describes
block a threat but can provide late triggers about an the current cyber-security landscape and the problem
undergoing attack. These triggers are often referred we are tackling, section 4 presents our approach on
to as detection capabilities as they do not provide any building a neural network that uses both statically and
protection. dynamically extracted features, section 5 shows the
As a general observation, model designs for de- databases used in our experiment and several results
tection don’t necessarily need to have a low false pos-
1 https://www.av-test.org/en/statistics/malware/ 2 Security Operation Center
and finally, section 6 draws several conclusions on the cryption is being used for creating malicious files. In
practical aspects of our proposal. a recent study (Aghakhani et al., 2020) this aspect was
investigated and, using a dataset of almost 400 thou-
sand executables, it was demonstrated that using static
2 RELATED WORK information exclusively isn’t indicative of the actual
behavior of the classified files and a substantial num-
Over the last decade, malware detection has been one ber of false positives on packed benign files occur.
field that attracted a lot of attention from researchers Dynamic
who used various machine learning methods. There The other main approach would be to extract dy-
are two main approaches that are usually employed namic features that describe the behavior of the mal-
when it comes to deciding what features will be used ware during execution or partially retain information
in the process of training and evaluating machine regarding the said behavior.
learning models. One method is to include dynamic runtime op-
Static codes as input features, allowing the behavior of ex-
The first approach involves extracting static fea- ecutables to be captured. An extensive study (Carlin
et al., 2019) showed that this approach can accurately
tures from the files, without executing them. Thus,
detect malware, even on a continuously growing and
this approach usually consumes fewer resources and
updatable dataset that requires retraining. The authors
reaches high speed and potentially high accuracy.
compared 23 machine learning algorithms and con-
(Ahmadi et al., 2016) proposed a malware fam-
cluded that their method worked best using the Ran-
ily classification system, using a wide range of static
dom Forest model.
features extracted from the original PE executable
In a recent study (Zhang et al., 2023), the authors
files that were not unpacked or deobfuscated. By
combining the most relevant feature categories and proposed another method of combining the API call
feeding them to a XGBoost-based classifier, their sequences-based dynamic features with the semantic
model reached an accuracy of around 99.8% on the information of functions, bringing more context to the
Microsoft Malware Challenge dataset (Ronen et al., actual performed action by the API call. Compared
2018) of 20000 malware samples. to existing similar experiments that only used API
Other approaches demonstrated that minimal call information, their solution shows improvements
knowledge is needed for extracting relevant static fea- of 3% to 5% in detection accuracy.
tures from executable files. For instance, a study Static + dynamic separately
demonstrated that effective malware detection can be (Ijaz et al., 2019) compare several methods based
obtained using the information in the first 300 bytes on machine learning for detecting Windows OS ex-
from the PE header of executable files as input (Raff ecutables. They use a small set of files of only
et al., 2017b). The same year, a more comprehen- 39000 malicious binaries and 10000 benign ones,
sive study (Raff et al., 2017a) presented MalConv, a from which they statically extract a small set of 92
deep convolutional neural network model which di- features from the PE headers using the PEFILE tool.
rectly uses the raw byte representation of executable They also dynamically extract 2300 features from a
(limited to the first 2 MB) as input, without any intelli- small part of the files from the execution in Cuckoo
gent identification of specialized structures or specific Sandbox. Their detection measurements are made us-
executable or malware content. The model showed ing either the static features or the dynamic features
good results, achieving 94% accuracy after training separately. Also, using a sandbox for the training
on a large dataset of 2 million PE files. and evaluation part when using the dynamic features
In a more recently conducted study (Zhao et al., brings in a series of disadvantages, because it does
not provide a form of real-time protection for the new
2023), the authors researched a different method, con-
malicious files that would need to be evaluated.
verting the bytecode extracted from the original files
Hybrid
into color images and using them as input features
In one of the first such approaches, (Santos et al.,
for training an AlexNet convolutional neural network
2013) present a hybrid malware detection system that
(CNN). The results were promising, the accuracy of
combines both static and dynamic features. The small
their model reaching more than 99% on two rather
dataset they use consists of 1000 malicious programs
small public malware datasets from Google Code Jam
and Microsoft of around 10000 samples. and 1000 legitimate executables, from which they ex-
However, using static features alone might bring tract two-byte opcodes, perform feature selection us-
some limitations in real-world malware detection sce- ing Information Gain and select the first 1000 as the
narios where advanced obfuscation, packing or en- static features that will be used. The dynamic charac-
teristics are extracted by monitoring the behavior of • static detection - usually associated with pre-
the programs in a controlled sandboxed environment. execution / on-access scanners. The main char-
Another different hybrid approach that uses both acteristic of this type of detection is that it takes
static and dynamic features for malware detection is a file as an input (but a file that was not executed
proposed in (Zhou, 2019). The authors use a sandbox yet) and analyzes it (in terms of its content). Se-
for recording API call sequences from the execution curity products refer to this type of detection as
of 90000 malicious and benign files and they extract protection as that file is not executed yet and de-
dynamic features out of a trained RNN model that is tecting it at this point blocks the attack and keeps
fed with the recorded sequences. The static and dy- the user protected. This is heavily used in con-
namic features are then combined into custom images sumer products where the expectation is to block
that will be used in the training and validation phase everything and keep the user protected.
of a CNN model. Both studies demonstrate how com- • dynamic detection - usually associated with post-
bining both static and dynamic characteristics as input execution scanners. It implies that the file is al-
features for their models brings improvements in de- lowed to run, while at the same time its actions are
tection rates. monitored. This type of detection is better at iden-
Compared to our approach, these two studies have tifying behavior and intent, but it’s less resilient in
a few limitations, one related to the small number of terms of protection (once some data is copied to
files used in the dataset and another related to the us- an external site, even if we record the event, we
age of a sandbox for extracting dynamic features dur- cannot un-copy it). Enterprise solutions use this
ing execution. Thus, their systems could provide only method as part of EDR/XDR products to record
offline detection and classification mechanisms and data related to an attack and automatically create
are no practical solutions for a product which must a root cause report.
provide real-time protection against malware.
From a detection point of view (and in particular,
if we refer to machine learning models) there are sev-
eral distinct features that each of these two detection
3 PROBLEM/SECURITY methods (static and dynamic) have:
LANDSCAPE • static detection methods are usually used in the
pre-execution phase. This actually means that for
The cat-and-mouse game has been a constant of the example, before a file gets executed, its content
cyber-security ecosystem for decades; malicious ac- is scanned. From a technical perspective, this is
tors create a new threat, cyber-security solutions adapt achieved via a kernel mode driver that stops the
then the new threats adapt to the new cyber-security execution until the result from the scanner is avail-
changes and the cycle goes on. And while this type of able. While this method ensures protection (noth-
change is inevitable, there were other (more business- ing gets executed unless it was scanned), it also
related) changes that a security product suffered dur- imposes certain limitations. If for example, the
ing the years. duration of a scan is one second for each file,
One such important change was the split between the entire operating system will be heavily slowed
types of users: enterprise and consumer. While con- down. As such, models that are used in this phase
sumer users are more interested in protection (the have to be fast (fewer neurons or other forms
security solution is perceived as a tool that quietly of more classical machine learning approaches
ensures that everything is secured), the business en- such as binary decision trees, random forests, etc).
vironment comes with several different challenges. It’s also important to notice that features used in
When a breach happens, there is a need (sometimes these methods are extracted directly from the file
driven by compliance regulations) to understand ex- content (strings, section information, disassembly
actly what endpoints were affected, what kind of data listings, imports and exports, etc) and don’t reflect
was exfiltrated, when the attack started or what set of the behavior of a sample but rather a probability of
measures would reduce the chance for a similar attack something being malicious.
to happen in the future.
While these differences relate mostly to a prod- • dynamic detection on the other hand is used with
uct feature (centralized dashboard, reports for en- events that are recorded asynchronously. As such,
terprise environment and automated flows for con- the performance impact is reduced and assuming
sumer), there are several differences that regard threat storage space is not an issue, larger models (e.g.
detection as well. As a result, threat detection differ- neural networks with multiple hidden layers) can
ences can be classified in regards to: be used. It is also worth mentioning that the input
Static Dynamic
Detection Detection
Susceptible to packers and Yes No
obfuscation techniques
Behavior false positive No Yes
(use of startup registry
keys)
Susceptible to dynamic No Yes
detection evasion techniques
(behave differently if
monitored)
Table 1: Static vs dynamic detection
5 RESULTS
5.1 Database
We used a private dataset containing 5455942 PE
files, of those 4068535 are labeled benign and
1387407 are labeled malicious. The reason for the
dataset containing 75% benign files is to reflect that
on a typical computer, there are many more benign
files than malicious ones. As such, this ratio helps us
avoid a model prone to false positive alerts.
The static features representing the PE file struc-
ture were extracted using a component of an AntiMal-
ware solution. Examples of static features are found
in table 1.
There were more than 150000 static features ex-
tracted for each file. Much like in (Dahl et al., 2013)
we had to reduce the number of features that would
be used for the model based on static features. The
total uncompressed size of the extracted static data
was 5.5TB. In order to perform feature selection on
such a large quantity of data in a reasonable amount
of time we had to do the selection in multiple steps.
These consecutive steps are described by the function
Select.
A = all f eatures(D);
do
largest n s.t. n ∗ len(A) < limit;
d = n random instances with features(D,
n, A);
k = max(c, len(A)/2);
A = {};
for select K best in S do
A = A ∪ select k best(d, k);
end
while len(A) > c;
Algorithm 1: Progressively select the best features
of the dataset
Metric Value
FP Rate 0.368%
TP Rate 86.746%
F1 Score 92.234%
Accuracy 96.346%
Table 4: Performance of model using static features
REFERENCES
Aghakhani, H., Gritti, F., Mecca, F., Lindorfer, M., Or-
tolani, S., Balzarotti, D., Vigna, G., and Kruegel, C.
(2020). When malware is packin’ heat; limits of ma-
chine learning classifiers based on static analysis fea-
tures. Proceedings 2020 Network and Distributed Sys-
tem Security Symposium.
Ahmadi, M., Ulyanov, D., Semenov, S., Trofimov, M., and
Giacinto, G. (2016). Novel feature extraction, selec-
tion and fusion for effective malware family classifi-
cation.
Carlin, D., Okane, P., and Sezer, S. (2019). A cost analysis
of machine learning using dynamic runtime opcodes
for malware detection. Computers Security, 85.
Dahl, G. E., Stokes, J. W., Deng, L., and Yu, D. (2013).
Large-scale malware classification using random pro-