0% found this document useful (0 votes)
14 views30 pages

Paper Springer

This document proposes a new approach for botnet detection using machine learning models trained on datasets generated from sandbox artifacts analyzed with natural language processing techniques. The authors aim to address limitations in existing research by developing a comprehensive ground truth dataset, creating robust detection models, and extracting more complete behavioral features of botnets.

Uploaded by

t735424
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views30 pages

Paper Springer

This document proposes a new approach for botnet detection using machine learning models trained on datasets generated from sandbox artifacts analyzed with natural language processing techniques. The authors aim to address limitations in existing research by developing a comprehensive ground truth dataset, creating robust detection models, and extracting more complete behavioral features of botnets.

Uploaded by

t735424
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Qasim et al.

RESEARCH

Learning from sandbox artifacts using NLP: A


novel approach for botnet detection
Muhammad Qasim1* , Renos Avraam1,2
, Muhammad Salman and Jens
*
Correspondence:
[email protected] Abstract
??
0
Full list of author information is In modern times, the Internet has become a lifeline for our daily activities.
available at the end of the article However, with this growing reliance comes an alarming vulnerability to a myriad
of cyber threats. This research unveils an exciting breakthrough of machine
learning and natural language processing (NLP) in the battle against botnets.
Traditionally, botnet detection has relied on outdated signature-based techniques,
struggling to keep up with the ever-evolving threat landscape. In this study, we
embark on a groundbreaking journey, leveraging innovative machine-learning
approaches to propose a game-changing method for identifying botnets. Our
secret weapon? NLP methodologies extract crucial insights from
sandbox-generated reports, fueling the creation of robust datasets for training
machine learning algorithms. The results are nothing short of remarkable. Our
cutting-edge approach, bolstered by the formidable XGboost classifier, achieves
an astonishing botnet detection accuracy of 99.17% and an exceptional
ROC/AUC score of 0.9995. These findings demonstrate the immense potential of
NLP in extracting meaningful features and the effectiveness of machine learning
algorithms in detecting botnet malware. Join us as we delve into the realm of
NLP-driven botnet identification, paving the way for a safer digital landscape.
Together, let’s unleash the power of machine learning, decode the language of
cyber threats, and triumph over the formidable menace of botnets.
Keywords: botnets; sandbox; machine learning; datasets; NLP; BERT; Bag of
words; Word2vec; GloVe

1 Introduction
The integration of the Internet into our daily lives has opened up a new era of op-
portunities, products, and access to vast amounts of data. However, as our reliance
on the Internet grows, so does the threat it poses to our security [1] [2] [3]. This
increased connectivity has led to the exploitation of cyber security vulnerabilities,
resulting in various threats such as phishing, malware, ransomware, social engineer-
ing, identity theft, and denial-of-service attacks. One particularly concerning threat
is the existence of botnets [4] attacks in the current era. A botnet is a network
of compromised machines, controlled by one or multiple remote servers known as
command and control servers. These infected machines often remain inconspicuous,
operating normally until they receive a command to launch an attack. There are
various types of botnets, but they all function in a similar fashion [5]. For exam-
ple, the Zeus botnet [6], also known as Zbot, specifically targets financial market
data, resulting in significant global financial losses. Another example is the Windigo
botnet [7], it emerged in 2011, targeting approximately 10,000 Linux servers and
Qasim et al. Page 2 of 30

producing a remarkable 35 million spam emails. The Mirai botnet [8], surfaced in
2016, targeted high-end embedded and IoT devices, carrying out large-scale dis-
tributed denial-of-service (DDoS) attacks. These examples highlight the immense
scale and impact of botnets, emphasizing the critical need for effective measures to
detect and mitigate their activities.
Botnet detection is challenging due evolving nature of these attacks. Techniques
for botnet detection [9] can be broadly categorized into two main types: active tech-
niques and passive techniques. Passive techniques [10] primarily rely on monitoring
data collected from sources such as honeypots, packet inspection, and analysis of
spam records. On the other hand, active techniques involve direct communication
with information sources, often employing deeper probing and analysis. One ex-
ample of an active detection technique is the sinkhole method, which disrupts the
communication between the bots and the command and control server by severing
their connection. Another active technique is DNS cache spoofing, which involves
manipulating DNS cache entries to redirect botnet traffic toward detection systems.
Tracking flux networks is yet another active technique used for botnet detection. De-
spite the continuous advancements in botnet detection, the increasing complexity of
botnet threats has rendered conventional methods ineffective [11]. As botnets evolve
and employ sophisticated techniques, there is a need for innovative and adaptive
approaches to effectively identify and mitigate these threats.
Effective detection and prevention of botnets play a vital role in safeguarding the
security and integrity of computer systems. To assess the effectiveness of machine
learning in the field of cybersecurity, it is necessary to apply machine learning tech-
niques to existing datasets or generate new datasets. There are multiple approaches
for generating datasets that can be used for botnet detection. One such approach is
the utilization of the EMBER dataset [12], which involves extracting static features
from botnet samples. This dataset provides valuable insights into the characteristics
and behaviors of botnets, enabling the development of machine-learning models for
accurate detection. Another approach is the use of the CTU-13 dataset [13], where
bots are deliberately deployed and analyzed in a controlled laboratory environment.
By analyzing the network traffic generated by these bots, meaningful features can be
derived. This dataset allows researchers to study the real-world behavior of botnets
and develop machine learning algorithms to effectively detect and mitigate their
activities. By leveraging these datasets and applying machine learning techniques,
researchers can explore and evaluate the potential of machine learning in addressing
the challenges of botnet detection and prevention, ultimately enhancing the security
of computer systems.
This paper addresses three main limitations in botnet attack detection using ma-
chine learning. Firstly, there is a lack of a comprehensive botnet ground truth
dataset that covers modern attack techniques. Existing research has relied on
outdated and unbalanced datasets, often with a limited number of botnet types
[divekar2018benchmarking] [song2006description] [shiravi2012toward]. This lack of
benchmark datasets hinders the ability to compare different machine learning mod-
els and features, resulting in fragmented research and a lack of clarity regarding the
effectiveness of proposed detection models [koroniotis2019towards].
Secondly, the proposed machine learning models often struggle to withstand the
evasive strategies employed by hackers. Botnets continuously evolve and adapt to
Qasim et al. Page 3 of 30

avoid detection, rendering some detection models ineffective. This poses a significant
challenge in developing robust machine-learning models that can effectively detect
botnet attacks and overcome evasion techniques.
Thirdly, the current feature extraction techniques predominantly rely on static
features and do not capture the complete profiles of botnet behaviors. This limited
feature coverage may lead to missed detections and insufficient understanding of
the intricate characteristics and dynamics of botnets.
While classical machine learning and deep learning models have been utilized
for botnet detection [popoola2021smote] [saeed2022survey] [miller2016role], the ap-
plication of Natural Language Processing (NLP) has opened up new possibilities
for malware detection systems. ML-based text analyzers, leveraging NLP tech-
niques, have shown promise in enhancing the detection capabilities of botnet attacks
[8233569] [zhang2021hybrid] [lu2019malware] [mimura2022applying]. Addressing
these limitations and exploring the potential of NLP-based approaches can con-
tribute to the development of more effective and robust botnet detection systems,
enhancing the overall security of computer networks.
In this research, we propose an alternative approach to the conventional feature
selection method for detecting botnet samples from benign samples. The raw data is
generated by sandboxing real-world botnet samples and generating reports. Our aim
is to explore further methods for extracting features from these reports, particularly
those leveraging NLP approaches, which focus on the processing of human language
[eisenstein2019introduction].
The objective of our research is to create a dataset by extracting features from
sandbox-generated reports, identify NLP approaches applicable to this scenario,
and train machine learning algorithms for botnet detection. Additionally, we aim
to evaluate the viability of employing NLP for this task. To accomplish this, we
examine the performance of several machine learning algorithms on dataset AUB02,
while also investigating ways to increase the robustness of our models.
This study contributes to addressing the difficulties in botnet detection by propos-
ing novel techniques that leverage NLP and sandbox-generated reports. By focusing
on feature extraction, dataset creation, and machine learning algorithm training,
we aim to advance the field of botnet detection and enhance the effectiveness of
detection systems.

1.1 Contributions
Major contributions of our research are listed below:-
1 Release of ground truth botnet dataset. A new, real-world botnet dataset
that includes both legitimate and botnet samples, with 65.75% malicious and
34.24% benign from the past three years and 12 botnet families. The data is
collected from multiple industry sources, including virus total and abuse.ch,
which is being released. Ground truth is ensured by collecting all malicious
samples from databases of famous cybersecurity companies. The dataset is
generated by post-processing of malware analysis reports and available for
security researchers.
2 ML-based botnet detection analysis. Many suitable algorithms for botnet de-
tection are evaluated for performance utilizing a variety of methods, such as
Qasim et al. Page 4 of 30

supervised text classification, deep learning, one-class learning, and positive


unlabeled learning. The latter two methods mentioned have never been used
in the context of botnet detection before.We discovered that one-class learning
strategies did not perform well, with the Bag of Words strategy achieving the
highest F1 score of 45%. Evaluation of large no of AI classifiers (both shallow
and deep models) with NLP feature sets. Other researchers have used few ML
modles with limited features.
3 Feature engineering and generation of dataset AUBDS-02 and AUBDS-03
using Bag of words and BERT.
4 Algorithmic analysis of dataset and evaluations.

2 Literature Review
In this section, we embark on a comprehensive literature review, delving into the
realm of machine learning for botnet detection as explored by esteemed security
researchers. Each technique presented in the literature showcases impressive efficacy
in combating this pervasive threat. However, it is important to note that a direct
comparison of our study’s findings with others may not always be feasible due to
potential variations in the methodologies employed.
Diverse approaches exist within the realm of botnet detection. Some focus on de-
tecting botnets through the analysis of network traffic, while our primary objective
lies in identifying malicious bots themselves. As such, comparing our results directly
with studies adopting different methodologies might yield contrasting outcomes.
To facilitate a quick and comprehensive overview, we present a summarized version
of Table 1 at the conclusion of this section. This summary enables us to readily
compare our methods with those employed by other researchers, shedding light on
the distinctiveness and potential contributions of our approach to the broader field
of botnet detection.
Extensive research was conducted by U. Pedrero et al. [14], wherein a complete
one-day feed from public and private sources was dissected. The focus was on ex-
amining Windows executables in detail and comparing the outcomes generated by
static and dynamic analysis tools. Clustering techniques were employed to identify
similar sandbox findings and classify them based on different priorities. The study
aimed to evaluate the computational resources and human effort required by orga-
nizations to comprehend daily malware feed, while also addressing the challenges
encountered during sample feed analysis. DBot [15] introduced a technique for iden-
tifying and grouping botnets that communicate with their command and control
(CC) servers using domain generation algorithms (DGAs). By monitoring DNS traf-
fic, the research aimed to detect patterns commonly associated with DGA-based
botnets. The underlying assumption of the DBot approach was that the majority
of domains generated by the DGA do not possess valid IP addresses, resulting in
an NXDomain response. The study reported an average accuracy of 99.69 across
four distinct datasets under controlled conditions. However, it is important to note
that real-time scenarios revealed limitations, as bot activity does not always align
with the collection of DNS messages required for accurate identification. In a sep-
arate research effort [16], a methodology based on word embedding and the Long
Short-Term Memory (LSTM) network model was proposed for packet-based ma-
licious traffic classification. The evaluation demonstrated comparable performance
Qasim et al. Page 5 of 30

to previous efforts in categorizing flows as benign or malicious, while significantly


reducing the time required for flow pre-processing. The main advantage of this
proposed methodology is its ability to accelerate detection without the need for
pre-processing packets into flows.
Jianjing Cui et al. [17] present a network intrusion detection model based on
word embedding. The author successfully utilizes word embedding to reduce the
dimensionality of features and employs deep learning to automatically extract rele-
vant features. The Word2Vec technique is employed to extract text sequences from
raw network traffic, allowing for the extraction of essential features. Deep learning
techniques process these features and word vectors to identify patterns of mali-
cious traffic. The author claims to achieve high detection accuracy with minimal
false positives.In the work by Rajvardhan Oak et al. [18], machine learning algo-
rithms are investigated for malware classification. The authors employ a supervised
LSTM-based strategy to classify malware. Using a substantial dataset of 183,000
samples, with two-thirds of them being malware, they achieve an impressive F1
score of 0.985. Similarly, BERT, a natural language-based model, is utilized with
imbalanced datasets, attaining an F1 score of 0.919.
In another research [19], the authors evaluate their approach using two existing
datasets: ISOT [20] and ISCX [21]. The ISOT dataset is generated by capturing
network traffic from honeypots, while the ISCX dataset simulates legitimate and
malicious network traffic to mimic real-world scenarios. The authors extract 29
features from the traffic, which are then reduced to 10 using a feature reduction
algorithm. The study achieves an accuracy of 99.2%, a false positive rate of 0.75%,
and an average detection rate of 99.08%.
In their research, Ali Feizollah et al. [22] focused on the detection of mobile bot-
nets using machine learning classifiers. They extracted features from network traf-
fic and applied various classifiers for detection purposes. The study compared the
performance of Naive Bayes, k-nearest neighbor, decision trees, multi-layer percep-
trons, and support vector machines. Experimental results showed that the k-nearest
neighbor (KNN) classifier achieved the highest true positive rate of 99.94% and a
false positive rate of 0.06%. The UMUDGA dataset [23] was developed specifically
for detecting automatically generated domain names used in botnet identification.
During the feature engineering phase, NLP and nGrams techniques were employed
for feature extraction. The dataset was analyzed using different machine learning
algorithms, with the random forest algorithm achieving the best results. M. Wo-
jnowicz et al. [24] conducted a study on entropy analysis of malicious files. They
utilized wavelet decomposition to obtain the entropic wavelet energy spectrum for
the entropy stream of each file. The wavelet decomposition of software entropy
proved useful for predicting malware by capturing suspicious patterns in portable
executable files. The study demonstrated that wavelet decomposition-derived fea-
tures enhanced the performance of a large-scale parasitic malware detection task.
Furthermore, a classifier based solely on three types of features (strings, entropy,
wavelet) yielded promising predictive performance.

3 Methodology
The proposed methodology encompasses several steps, including data collection,
feature extraction, application of machine learning algorithms, and evaluation of
Qasim et al. Page 6 of 30

Table 1 Summary of features of datasets and generation techniques


Name Datasets ML Approach Features Limitations Accuracy
DBot[15] ISCX2012 [25] - - - 99.69%
ISCX2012,
word embedding, raw data,
R.-H et al.[16] USTC-TFC2016, - 99.99%
LSTM packet based
IoT datasets
word embedding, process packet, passive
J. Cui et al.[17] ISCX2012[25] 99.99%
DL flow-based detection
passive
R. Oak et al.[18] - BERT, LSTM - 99.10%
detection
P2P botnets,
ISCX2012 [25] ,
G. Alauthaman et al.[26] neural networks packet header TCP traffic, 99.20%
ISOT [27]
outdated
Android Malware network android
A. Feizollah et al.[22] ML classifiers 99.94%
Genome Project[28] features apps only
M. Zago[23] UMUDGA[23] random forest nGrams, NLP specific botnets 98.90 %
Cylance strings,
M. Wojnowicz et al.[24] logistic regression static features 98.90 %
repository[29] entropy statistics
BERT model, sandbox
Proposed Approach AUBDS-02 ML classifiers 99.40%
bag of words evasion

results. A flow chart illustrates the sequence of these processes, which will be dis-
cussed in detail in the subsequent sections.
The first step in the methodology is data collection, where relevant datasets are
acquired for the botnet detection task. These datasets may consist of network traffic
data, malware samples, or reports generated by sandboxing techniques.
Next, the feature extraction process is employed to extract meaningful information
from the collected data. This involves techniques such as NLP, wavelet decomposi-
tion, or other feature engineering methods to identify and extract relevant features
that can distinguish between botnet and benign samples.
Once the features are extracted, machine learning algorithms are applied to train
models for botnet detection. Various algorithms, including Naive Bayes, k-nearest
neighbor, decision trees, multi-layer perceptrons, support vector machines, and ran-
dom forest, may be utilized based on their suitability and performance for the given
task.
After training the models, the evaluation phase assesses their performance. This
includes measuring metrics such as true positive rate, false positive rate, accuracy,
F1 score, and other relevant evaluation criteria. The results are analyzed to deter-
mine the effectiveness and efficiency of the proposed methodology.
Throughout the discussion, accompanying figures, including the flow chart, pro-
vide a visual representation of the proposed methodology, illustrating the sequence
of steps and processes involved in the detection of botnets using machine learning
techniques.

3.1 Data Generation


During this phase, the focus is on collecting samples in their unprocessed origi-
nal form and generating reports. In the context of botnet detection, the goal is to
gather raw data or samples that accurately represent botnet activity. This collection
process can involve various sources, such as capturing network traffic from honey-
Qasim et al. Page 7 of 30

Figure 1 Overview and flow diagram of complete processes

pots, accessing malware samples from repositories, or employing other techniques


to obtain relevant data.
The collected samples are typically kept in their original, unprocessed state to
preserve their inherent characteristics. This ensures that the data reflects the true
nature of botnet behavior and enables more accurate analysis and detection. In
addition to collecting the samples, reports are generated to provide structured in-
formation about each collected sample. These reports contain detailed insights into
the behavior of the samples, their communication patterns, identified malicious ac-
tivities, and any other pertinent information that aids in comprehending the nature
of the botnets under investigation.
The generated reports serve as valuable resources for subsequent analysis and
feature extraction. They offer a comprehensive overview of the collected samples,
facilitating further investigation and the identification of distinctive features that
can enhance botnet detection methods. Overall, this phase encompasses the collec-
tion of unprocessed samples and the generation of reports, ensuring the availability
of both raw data and structured information crucial for the subsequent steps of the
botnet detection process.
The combination of collecting unprocessed samples and generating reports in this
phase ensures that the botnet detection process has access to both raw data and
structured information. The raw data, in its unprocessed form, provides a true rep-
resentation of botnet activity and retains the original characteristics of the samples.
This raw data is essential for conducting in-depth analysis and feature extraction.
On the other hand, the generated reports offer structured information that pro-
vides valuable insights into the behavior, communication patterns, and other rel-
evant details of the collected samples. These reports serve as a consolidated and
organized resource for further analysis, facilitating a comprehensive understanding
of the botnets under investigation.
Qasim et al. Page 8 of 30

The availability of both raw data and structured information enables subsequent
steps in the botnet detection process, such as feature extraction, machine learning
algorithms, and evaluation of results, to be conducted effectively. It ensures that
the detection methods can leverage the combination of raw data and structured
insights to accurately identify and classify botnet activity. Therefore, this phase
plays a crucial role in ensuring that the necessary data and information are available
for the successful detection of botnets.

3.1.1 Sample Collection


By searching for terms like ”bots”, ”botnets” and other related terms on GitHub
repositories, Twitter, Google, and the internet, we undertake an extensive survey
to find, gather, and compile publicly available botnets samples. Collecting botnet
samples in the real world is difficult and potentially dangerous, so numerous security
firms like virus total [30], abuse.ch [31], run-any [32] etc, were also contacted. A
total of 5,231 samples were collected from all these sources. To ensure that all
of the samples were actually botnet-related, samples are submitted to VirusTotal,
and AVclass [33]. The 2,724 benign samples were gathered from various system
tools, Windows OS accessories, and so on. [would be good to include the time
span i-e botnet samples that first appear in 2015 till, and OS details also, whether
the samples are collected from Windows only ? A table be added with following
columns: File Type (malicious or benign), number, Date, OS]

3.1.2 Sandbox Analysis


To analyze collected samples, cuckoo [34], an open-source sandbox framework, was
used to create a virtual environment. The Sandbox automatically executes the mal-
ware sample in a virtual machine environment after it is uploaded. It analyses the
malware’s behavior during the analysis, such as system changes, network activity,
and file alterations. Before submitting a new sample, the virtual computer is reset
to its clean state so that each piece of malware, or benign file, can operate indi-
vidually in a safe environment. Sandbox generates a report in JSON format after
completing the analysis, which includes information about the malware’s activity
cycle. The report contains a wide range of data, such as system calls, network traffic,
and registry changes. The following are the results of this process:
1 Malicious: 5,231 Reports.
2 Benign: 2,724 Reports.

3.1.3 Reports Examination


In this section, we will go over the various elements in the reports, briefly describe
what information they provide (because there is no clear documentation regarding
the reports generated by the sandbox), and determine whether or not the data is
beneficial to us. Each of the subsections that follow will represent a single item
from the report. A few items that arise in reports on rare cases are not listed in the
coming sections, but we believe they may contain essential information and have
not discarded any of them. Detailed dissection and its importance for NLP models
are described in Table 2.
Qasim et al. Page 9 of 30

Table 2 JSON report attributes that are intended for NLP models
Title Representation Criticality NLP contribution
info sandbox platform,analysis information nil no
network tcp/udp,dns,domains bad IPs/URLs yes
static static and embedded properties entropy yes
behavior system calls,files,registry keys file modifications yes
signatures pattern match malicious patterns yes
strings extracted strings from binaries strings match yes
debug error&analysis logs nil no
metadata name,hash nil no

3.1.4 Data Cleanse


We need to perform data cleaning on the collected reports, including the removal
of defective or duplicate reports to avoid potential issues. We use Python’s JSON
module [35] to check if any reports are malformed and parse them accordingly. Dur-
ing the examination of multiple reports, it was identified that some lacked essential
information, and the corresponding samples did not function properly, resulting in
sandbox exceptions. Furthermore, analysis reports with a duration of less than 2
seconds was meticulously reviewed and excluded. As mentioned earlier, certain re-
port content was irrelevant for classification as it only provided information about
the report itself rather than the sample or its behavior. Therefore, the ’info,’ ’de-
bug,’ and ’metadata’ sections were completely removed from all reports. The data
cleansing process resulted in a collection of reports that are in scope, free from mal-
formation, contain accurate malware information, and have no duplicates. Although
the dataset has not been significantly reduced, we still have sufficient samples to
complete the task. The final dataset includes the following entries:

Tables
Table 3 Number of malicious and benign samples after data cleanse
Ser. type of samples number
a. malicious 5,027
b. benign 2,652
total 7679

3.2 Feature Extraction


At this stage, we extract features from data into structured features before train-
ing and testing the machine learning models. [Feature extraction [36] refers to the
selection and extraction of valuable qualities from raw data with help of algorithms
or mathematical techniques. The purpose of feature extraction is to reduce the di-
mensionality of the original data and collect the most important information in a
meaningful fashion]. We convert the list of words in each report into the feature
vector using feature extraction algorithms as shown in Figure 2, including syntactic
and semantic features, to evaluate their impact on classifier accuracy.

3.2.1 Syntactic / non-semantic count based vector space model


To transform our raw textual data into numerical representation, we employ the
Bag-of-Words (non-semantic) technique. This approach, rooted in Natural Lan-
guage Processing (NLP), treats text as a collection or ”bag” of individual words or
tokens, disregarding word order or context. First, we read the file and determine its
Qasim et al. Page 10 of 30

Figure 2 Classification of several feature extraction techniques

class. Then, we remove symbols and punctuation marks, such as commas, paren-
theses, and quotation marks, from the text, merging all the words into a single
cohesive phrase. Since our focus is not on contextual meaning but rather on feeding
the words to the algorithm, this consolidation suffices. For implementation, we uti-
lize the scikit-learn CountVectorizer(), which incorporates a built-in English stop
word list to remove unnecessary terms. We set a parameter to include only the top
1000 most frequent features. The resulting token count matrix is then organized
into a data frame, along with the previously stored labels, and saved as a CSV file
for future use.

3.2.2 Word embedding/ Semantic


To address the challenge of capturing semantic content, we employ word embed-
dings [37]. Word embedding enables us to generate semantic feature vectors for
words, capturing their meaning and contextual information. This approach allows
us to move beyond syntactic representation and delve into the deeper semantic un-
derstanding of words. By utilizing word embedding techniques, we aim to enhance
the interpretation and analysis of the collected data.
1 Context-independent vector space model To generate word embeddings, we
employ both static and dynamic models. For static embeddings, we employ
the context-independent vector space model (traditional word embeddings)
using techniques such as Word2Vec [38] and the Gensim library [39]. The
Word2Vec model utilizes pre-trained word embeddings with 300-dimensional
vectors that are retained statically during training. On the other hand, the
dynamic technique modifies the vectors during the training phase. Addition-
ally, we utilize pre-trained GloVe embeddings [40] from the Stanford NLP
group and fastText embeddings [41] using the fastText Python library. These
approaches enable us to capture the semantic representation of words in our
analysis.
2 Context-dependent vector space model To overcome the limitations of the
conventional vector space model, we adopt a context-dependent vector space
model that leverages contextualized word representations. We employ two
powerful contextualized word embeddings: Embedding from Language Model
(ELMo) [42] and Bidirectional Encoder Representations from Transformers
(BERT) [43]. BERT, developed by Google in 2018, has emerged as a cutting-
edge machine learning model for various natural language processing tasks
Qasim et al. Page 11 of 30

such as sentiment analysis, information retrieval, and sentence classification


[44]. We create an implementation environment with specialized hardware,
including an AMD 3700x CPU, RTX 2080s GPU, and 32GB RAM. We load
Torch 1.8.1 [45], CUDA toolkit 11.2 [46], and cudnn 8.1 [47] to optimize per-
formance.
In our approach, we utilize BERT to convert natural language sentences into
vector representations. Since context plays a crucial role in understanding the
input, we specifically apply BERT to the ’description’ column in the signature
row of reports. To prepare the reports for input, we process them by extracting
the relevant fields and combining them into a single long string. To effectively
tokenize the input for BERT, we employ the BERT tokenizer from the trans-
formers library [48]. The tokenizer splits the text into tokens and adds special
tokens like [CLS] (classification) and [SEP] (separator) to mark the start and
end of the input sequence. Additionally, [PAD] tokens are inserted for padding
when necessary. The tokenization step produces a dictionary containing the
input ids, which are used to index a pre-trained matrix of word embeddings.
This indexing allows us to obtain the corresponding word embeddings for the
input text. Furthermore, an attention mask is created to indicate the presence
of [PAD] tokens and facilitate proper attention mechanism within the model.
Finally, we use PyTorch to transform the input sequence into a tensor3 format,
as required by the BERT model. This format enables efficient processing and
utilization of the contextual information captured by BERT for downstream
tasks in our natural language processing pipeline.
To generate the relevant embeddings, we leverage Python and TensorFlow
Hub to access ELMo (Embeddings from Language Model). Unlike traditional
word embeddings that represent each word as a fixed vector, ELMo provides a
dynamic and context-dependent representation by considering the surround-
ing context. ELMo is based on a deep bidirectional language model trained on
a large corpus of text. ELMo learns to predict the next word in a phrase given
the previous words, capturing rich contextual information in the process. This
dynamic nature of ELMo enables it to capture nuances and variations in word
meanings based on their specific contexts. By utilizing ELMo in our natural
language processing pipeline, we can incorporate the contextual dependen-
cies of words and enhance the understanding of the semantic content. This
allows us to derive more nuanced and accurate representations of the words,
improving the performance of our language models and text analysis tasks.
Once we have obtained the input tensor, we extract the features using the pre-
trained BERT model. Due to memory limitations, we divide the data into batches
and apply the BERT model to each batch. For each input tensor, we extract the
CLS token from the previous hidden state and associate it with its corresponding
label. To apply the BERT model to the input tensor, we pass the tensor through
the encoder layers of the model and utilize the output for our specific task. In our
case, we aim to generate sentence embeddings represented by the CLS token, which
captures the overall meaning of the input sentence. These embeddings will be used
for subsequent classification tasks. Once all the inputs have been processed by the
BERT model, we save the corresponding labels and CLS tokens in a CSV format.
Qasim et al. Page 12 of 30

This allows us to organize and utilize the extracted features for further analysis and
classification purposes.

3.3 Feature Selection


Feature selection is a crucial step in the machine learning process, aimed at identify-
ing the most relevant features from a larger set of potential features in a dataset. The
objective is to identify the features that have the strongest influence on the target
variable while minimizing the impact of irrelevant or redundant features that may
introduce noise or overfitting. In our case, we initially extracted 1000 features using
the Bag-of-Words (BoW) approach. However, it is important to determine which
of these features are truly informative for our task of detecting botnet malware. To
assess the importance of features, we employed the following methods.

3.3.1 Permutation Feature Importance


To minimize the number of features and identify the most essential ones, we em-
ployed the Permutation feature significance method in conjunction with Random
Forest classifiers. This approach is based on the principle that if a feature is crucial
for a model, altering its values should result in a significant decrease in the model’s
performance. The Permutation feature importance can be applied with any ma-
chine learning algorithm, but we specifically chose the Random Forest classifier due
to its reputation as a robust model that is less prone to overfitting. By evaluating
the impact of permuting the values of each feature on the model’s performance, we
can assess the importance of individual features and identify the ones that have
the greatest influence on our task. The formula for calculating the relevance of ij
feature fj is prescribed as:

K
1 X
ij = s − Sk, j
K
k=1

Where s is the score(accuracy), K is the number of repetitions and j is the


column index. After executing the Permutation feature significance algorithm with
the Random Forest classifier, we eliminated any features whose significance metric
was zero or negative. A negative importance value suggests that altering the values
of that feature actually improved the model’s accuracy, indicating that the feature is
confounding the model and should be discarded. A zero importance value indicates
that the feature has no impact on the outcome.
It is worth noting that for the BERT model, feature selection was not required as
we would be utilizing the CLS token as a complete feature. The CLS token captures
the overall representation of the input sentence, making individual feature selection
unnecessary in this case.

3.4 Machine Learning


After performing comprehensive feature engineering, machine learning algorithms
are applied to the datasets generated from sandboxing reports. The target variable,
which determines whether a sample is malicious or not, is labeled, and the models’
ability to predict is evaluated.
Qasim et al. Page 13 of 30

The dataset consists of 7,679 samples, including both benign and malicious sam-
ples. To train the models, the first step is to split the data into two parts: a training
set and a testing set. The training set will be used for hyperparameter tuning and
model training, while the testing set will be reserved solely for evaluating the final
model’s performance.
Here are the details of the training and testing samples:
Training Set: This set is used for model training and hyperparameter tuning.
It comprises a portion of the dataset and is essential for optimizing the model’s
performance. The specific size or proportion of the training set is not mentioned in
the provided information.
Testing Set: This set is kept separate and is used to evaluate the model’s perfor-
mance after it has been trained. It serves as an unbiased assessment of the model’s
ability to generalize to new, unseen data. The size or proportion of the testing set
is not mentioned.
By splitting the dataset into training and testing sets, we can ensure that the
models are trained on a representative sample of the data and evaluated on inde-
pendent data to assess their performance accurately. Details of training and testing
samples are as under:-

Tables
Table 4 Traing and testing split samples for ML models
Ser. type of samples malicious benign total
a. train set 3740 2019 5759
b. test set 1287 633 1920

During the hyperparameter tuning process, different algorithms and their respec-
tive hyperparameters were evaluated to identify the best-performing models. For
the MLP classifier, GridSearchCV was utilized with a manual selection of param-
eter values. Six distinct choices were made for the hidden layer size, three for the
maximum number of iterations, two for activation functions, solvers, alpha values,
and learning rate settings, resulting in a total of 36 combinations. This was further
multiplied by 10 for 10-fold cross-validation, resulting in 360 model trainings to find
the optimal parameter combination [?].
For the remaining models, BayesSearchCV was employed. Unlike GridSearchCV,
which exhaustively explores all parameter combinations, BayesSearchCV uses a
probabilistic function to guide the search and prioritize promising parameter sets.
By considering prior assessments, it focuses on areas of the parameter space likely
to yield higher accuracy. This approach reduces the number of iterations required
to find the optimal hyperparameters and avoids exploring irrelevant regions of the
parameter space [49].
Cross-validation was employed to assess model performance and detect overfitting.
If there is a significant difference between the validation accuracy and the test
accuracy, it indicates overfitting. However, in this study, no substantial differences
were observed, suggesting that the models were able to generalize well.
It is worth noting that hyperparameter tuning and cross-validation can be com-
putationally intensive and time-consuming processes, as they involve training and
evaluating multiple models with different parameter settings. Therefore, the choice
Qasim et al. Page 14 of 30

of search algorithm and the number of parameter combinations to explore should


be made based on available computational resources and time constraints.
The overall goal of the hyperparameter tuning and cross-validation was to ensure
that the selected models were optimized for performance, generalization, and avoid-
ance of overfitting. The Classifiers we employed and the accompanying parameters
we discovered to be optimal are listed below:
• KNeighborsClassifier(leaf size=1, metric=’manhattan’, n neighbors=3,
weights=’distance’)
• LogisticRegression(C=2.860760167238615, max iter=4897,
tol=1.2441575953365077e-05, solver=’lbfgs’)
• DecisionTreeClassifier(max depth=100, max features=’auto’,
min samples leaf=2, min samples split=2 )
• RandomForestClassifier(bootstrap=False, max depth=44,
min samples leaf=2,max features=’auto’, min samples split=5,
n estimators=1800, random state=69)
• XGBClassifier(base score=0.5, booster=’gbtree’,
colsample bylevel=1,colsample bynode=1, colsample bytree=1,
gamma=0,learning rate=0.0846542798609431, max delta step=0,
max depth=3, min child weight=1, missing=1, n estimators=1610,
n jobs=-1, nthread=None, objective=’binary:logistic’
, random state=42, reg alpha=0, reg lambda=1,
scale pos weight=1,seed=69,
silent=None,subsample=0.6672598578681597)
• MLPClassifier(activation=’tanh’, alpha= 0.0001,
hidden layer sizes= (200,),learning rate= ’adaptive’,
max iter=100, solver= ’adam’)

3.5 Classification techniques


The selection of an appropriate classifier is crucial in botnet detection, especially in
the context of text classification. Researchers face challenges in establishing effective
methods, techniques, and structures for this task [50]. In this study, we explored
various machine learning algorithms to evaluate their performance on the original
reports. For two-class classification, we employed Support Vector Machines (SVM)
[51]. For One-class Classification, we utilized the One-class Support Vector Machine
(OCSVM) [52]. To tackle the problem of Positive and Unlabeled (PU) Learning,
we employed the PU bagging technique [53] with random forest (RF) [54]. For deep
learning approaches, we implemented Recurrent Neural Networks (RNN) [55] and
neural networks using fastText [56], as well as Transformers [57] and RoBERTa
[58]. Convolutional Neural Networks (CNN) [59] [60] [61] were also utilized in the
study. These models were created using Scikit-learn and additional libraries such as
Simple Transformers, TensorFlow Hub, and fastText.
By employing these diverse machine learning algorithms, we aimed to assess their
effectiveness in botnet detection based on the content of the reports.

4 Results and Evaluations


The findings of the machine learning algorithms will be discussed in this section.
[Section 4.1, 4.2, 4.3, 4.4 may be merged into one section. The confusion matrix
may be moved to appendix.]
Qasim et al. Page 15 of 30

4.1 The outcomes of the BoW method


In our initial experiment, our objective was to train different machine learning mod-
els and identify the best ones capable of accurately labeling botnet samples and
distinguishing between malicious and benign files. We utilized a dataset created
using the Bag of Words (BoW) approach and trained six distinct machine-learning
algorithms on it. To assess the performance of these models, we employed a tenfold
cross-validation technique and performed Grid/Bayesian search to explore various
parameter combinations. Among the six models, XGboost emerged as the top per-
former, achieving an accuracy of 99.17% on our dataset, as well as high precision,
recall, and F1 score, with a value of 99.06%. However, due to the imbalanced nature
of our dataset, accuracy alone may not be the most reliable metric. Nonetheless, the
ROC/AUC analysis yielded an exceptional score of 0.9994, indicating near-perfect
performance, while the Matthews Correlation Coefficient (MCC) score was 0.9811.
RandomForest exhibited similar performance to XGboost, with a slight decrease
across all evaluation criteria.
Table 5.1 presents a comprehensive overview of the performance metrics for our
models. Overall, most of our models demonstrated favorable results, with the Logis-
tic Regression model exhibiting the lowest performance. The success of our models
can be attributed to the quality of the dataset and the effectiveness of the features
extracted using the BoW technique. Our findings suggest that the features derived
from the BoW approach are capable of detecting malicious patterns effectively.

Table 5 Metric results of models trained using the BoW technique


Conventional two class classifiers (Supervised)
Classifiers Accuracy Precision Recall F1-score MCC ROC/AUC
Xgbooster 0.9917 0.9906 0.9906 0.9906 0.9811 0.9995
RandomForest 0.9839 0.9819 0.9815 0.9817 0.9635 0.9985
KNN 0.9458 0.9406 0.9363 0.9384 0.8769 0.9606
DecisionTree 0.9375 0.9346 0.9229 0.9284 0.8573 0.9482
MLPclassifier 0.9333 0.9287 0.9194 0.9238 0.8479 0.9765
LR 0.8964 0.8798 0.8894 0.8842 0.7691 0.9391
One Class Classification (Unsupervised)
OneClassSVM 0.7083 0.73 0.71 0.64 - -
PU Learning (Semi-supervised)
PUBagging 0.968 0.9689 0.968 0.9679 - 0.9975
Deep Learning
LSTM 0.6439 - - - - -
CNN 0.667 - - - - -

The number of true positives, true negatives, false positives, and false negatives
is summarised in a confusion matrix in Figure 1.
Figure 2 shows the ROC curve, which shows that RF and XGboost almost create
a perfect ROC curve, indicating that these models perform quite well. [figures are
not referred correctly]
The DET (detection error tradeoff) curves for our models are shown in Figure 3.
In our specific scenario, our preference is to have a higher number of False Negatives
and a lower number of False Positives. This preference stems from our concern of
misclassifying a malicious sample as benign. By prioritizing this aspect, we can
use the threshold value to evaluate and optimize the model’s performance. Upon
examining the curves, we observe a relatively uniform distribution, indicating the
absence of any significant bias in the model’s predictions. This suggests that the
Qasim et al. Page 16 of 30

Figure 3 Confusion matrices for models using the BoW approach


Qasim et al. Page 17 of 30

Figure 4 ROC curve using the BoW approach

model is not favoring any specific class and is making balanced predictions across
both benign and malicious samples.
Calibration curves in Figure 4 demonstrate the degree of alignment between the
model’s estimated probabilities and the actual probabilities of different events. A
well-calibrated model should have a curve that closely resembles the diagonal line,
indicating a match between expected and observed probabilities. Deviation from
the diagonal line suggests poor calibration. Despite our models exhibiting high ac-
curacy, precision, and recall, the predicted probabilities do not align well with the
actual probabilities. This discrepancy may be attributed to issues such as overfit-
ting, class bias, or suboptimal decision criteria [62]. Notably, the RandomForest
model appears to exhibit closer alignment with the well-calibrated line, implying a
higher level of reliability for this model.

It is fascinating to observe how different models interpret features. Figures 5 and


6 provide insights into the top 20 important features identified by XGboost and
RandomForest, which are the best-performing models in our study. At first glance,
these features may seem insignificant, but machine learning algorithms can uncover
patterns within them that contribute to accurate predictions. Notably, XGboost
assigns higher importance to the top feature compared to the second one, and the
second feature is prioritized over the third. However, the importance values of the
remaining features appear to level out. In contrast, the importance values of features
in the RandomForest algorithm exhibit a more gradual decline from the first to the
20th feature. This divergence in feature treatment may account for the perceived
better tuning of RandomForest compared to XGboost.
Qasim et al. Page 18 of 30

Figure 5 DET curve using the BoW approach

4.2 The outcomes of the Bert method


In the second part of our study, we utilized the BERT model to extract contextual
word embeddings and used the resulting vectors as features to capture the meaning
of the ”description” field in the ”signatures” section. Following the feature extrac-
tion, we applied the same set of algorithms to evaluate the dataset and compared
the results. Unfortunately, the performance of the models was not as good as in the
previous technique. Among the models, Random Forest achieved the highest accu-
racy of 85.6%, with a precision of 83.9%, recall of 84.2%, and an f1 score of 84%.
On the other hand, Logistic Regression exhibited the lowest performance, with an
overall score of approximately 75%.
The confusion matrices in Figure 7 provide insights into the performance of the
models. It is evident that the rate of True Negatives (malicious samples correctly
classified as malicious) is relatively high. This can be attributed to the fact that
the dataset consists of a larger number of reports from malicious samples compared
to benign files. The models seem to be more effective in identifying and classifying
malicious samples accurately.
The evaluation of the models using the ROC curve in Figure 8 indicates that
the algorithms are not performing optimally. This is evident from the fact that the
curves are far from the upper left corner of the graph. A well-performing model
would have a curve that closely hugs the upper left corner, indicating high true
Qasim et al. Page 19 of 30

Figure 6 Calibration plot

Table 6 Metric results of models trained using the Bert technique


Conventional two class classifiers (Supervised)
Classifier Accuracy Precision Recall F1-score MCC ROC/AUC
RandomForest 0.8563 0.8389 0.8420 0.8404 0.6808 0.9237
DecisionTree 0.8391 0.8202 0.8218 0.8209 0.6419 0.8708
Xgbooster 0.8391 0.8230 0.8138 0.8180 0.6367 0.9146
KNN 0.8164 0.7952 0.7948 0.7950 0.5900 0.8771
LR 0.7832 0.7625 0.7384 0.7473 0.5003 0.8248
MLPclassifier 0.7943 0.7738 0.7559 0.7631 0.5293 0.8544
One Class Classification (Unsupervised)
OneClassSVM 0.6072 0.57 0.61 0.58 - -
PU Learning (Semi-supervised))
PUBagging 0.8361 0.8361 0.8361 0.8361 - 0.9093
Deep Learning
LSTM 0.6507 - - - - -
CNN 0.6595 - - - - -

positive rates and low false positive rates. Comparing these results to Figure 5,
which represents the performance of the models using the Bag of Words (BoW)
technique, there is a noticeable difference in performance. The models in the BoW
technique seem to exhibit better performance compared to the models in the current
evaluation.

The DET curve in Figure 9 illustrates that even with adjustments to the threshold
chosen by a particular model, the outcomes remain unfavorable. This is evident
when examining the RF graph, where reducing the false negative rate to 5% results
in a false positive rate of approximately 35%. In both cases, the results fall below
the desired level of performance.
Qasim et al. Page 20 of 30

Figure 7 Top 20 feature importances for RF

4.3 The outcome of Word2Vec as Feature model


A powerful neural network model called Word2Vec was developed by Tomas Mikolov
and his team at Google specifically for natural language processing tasks [63].
It incorporates two learning methods, namely Skip-gram and Continuous Bag of
Words (CBOW) [64]. Word2Vec takes text input and generates word embeddings
that capture semantic relationships among words. In our study, we utilized the
static approach of Word2Vec, which employs pre-trained word embeddings with
300-dimensional vectors. However, for the dynamic approach, the vectors are fine-
tuned during the training phase. Among the models tested, Random Forest with
Word2Vec achieved the best performance, with accuracy and F1-score of 85.5

Table 7 Performance analysis of Machine Learning Algorithms using Word2Vec as Feature model
Conventional two class classifiers (Supervised)
Classifier Accuracy Precision Recall F1-score
KNeighborsClassifier 0.8486 0.8474 0.8486 0.8478
LogisticRegression 0.7823 0.7765 0.7823 0.7763
DecisionTreeClassifier 0.8347 0.8341 0.8347 0.8344
RandomForestClassifier 0.8555 0.8548 0.8555 0.8551
SVM 0.8132 0.8108 0.8132 0.8116
XGBClassifier 0.8548 0.8540 0.8549 0.8544
MLPClassifier 0.8013 0.8231 0.8013 0.8058
One Class Classification (Unsupervised)
OneClassSVM 0.6381 0.5925 0.6381 0.6043
PU Learning (Semi-supervised))
PUBagging 0.8543 0.912 0.8646 -

4.4 The outcome of GloVe as Feature model


We utilized a word representation model called GloVe (Global Vectors) [40], which
takes into account global corpus data to generate real-valued vectors representing
each word in a semantic vector space model. Among the models evaluated, the XG-
BClassifier with GloVe achieved the best performance, with an accuracy of 85.43%
and an F1-score of 85.41% as shown in Table 8.
Qasim et al. Page 21 of 30

Figure 8 Top 20 feature importances for XGboost

Table 8 Performance analysis of Machine Learning Algorithms using GloVe as Feature model
Conventional two class classifiers (Supervised)
Classifier Accuracy Precision Recall F1-score
KNeighborsClassifier 0.8454 0.8440 0.8454 0.8445
LogisticRegression 0.7905 0.7853 0.7905 0.7842
DecisionTreeClassifier 0.8347 0.8349 0.8347 0.8348
RandomForestClassifier 0.8524 0.8517 0.8524 0.8520
SVM 0.8088 0.8052 0.8088 0.8057
XGBClassifier 0.8543 0.8539 0.8543 0.8541
MLPClassifier 0.8145 0.8108 0.8145 0.8108
One Class Classification (Unsupervised)
OneClassSVM 0.6366 0.5970 0.6366 0.6083
PU Learning (Semi-supervised))
PUBagging 0.8454 0.9100 0.8522 -

4.5 Performance Evaluation of State of the art Neural and Deep Learning Classifiers
Modern text classification techniques heavily rely on neural networks as they have
shown great effectiveness in handling complex language patterns and capturing se-
mantic information. One popular neural network library used for text classification
is FastText, developed by Facebook. FastText is designed for both supervised text
categorization and unsupervised word embeddings. It introduces a lightweight and
efficient approach to training text classifiers by using a hierarchical structure, reduc-
ing the training and testing time complexity from linear to logarithmic with respect
to the number of classes. FastText also leverages the Huffman algorithm to optimize
computational efficiency, particularly when dealing with imbalanced classes, such
as in our Spam dataset, where some classes occur more frequently than others [65]
[56]. Deep learning architectures have emerged as powerful tools in various domains,
surpassing the performance of traditional shallow machine learning methods. This
holds true for natural language processing (NLP), image classification, and other
tasks. Unlike shallow machine learning, deep learning models can automatically
learn meaningful features directly from high-dimensional raw data. As a result, we
considered several state-of-the-art deep learning models that have been successfully
employed by researchers for text classification and spam detection. These models
include recurrent neural networks (RNNs), convolutional neural networks (CNNs),
Qasim et al. Page 22 of 30

Figure 9 Confusion matrices for models using the BERT approach


Qasim et al. Page 23 of 30

Figure 10 ROC curve using the BERT approach

transformer models, and more. The choice of architecture depends on the specific
learning objectives and the nature of the data being analyzed [66] [55] [59] [67].

Table 9 Performance Evaluation of State of the art Neural and Deep Learning Classifiers
Classifiers Accuracy Precision Recall F1-score
FastText 0.821 0.8200 0.8209 0.8204
BERT 0.8544 0.8912 0.8904 0.8908
distilbert 0.8550 0.8893 0.8927 0.8910
Roberta 0.8544 0.8896 0.8921 0.8909
LSTM 0.758 0.75 0.76 0.75
BiLSTM 0.715 0.73 0.72 0.72
CNN (Random) 0.76 0.76 0.77 0.76
CNN (GloVe) 0.722 0.77 0.72 0.73
ENSEMBLE (Random) 0.824 0.82 0.82 0.82
ENSEMBLE (Static) 0.786 0.78 0.79 0.78
ENSEMBLE (Dynamic) 0.782 0.78 0.78 0.77

5 Discussion and future Works


5.1 BoW approach
When it comes to feature selection, it is important to consider the overall perfor-
mance of the algorithm. If the algorithm shows poor results in terms of low test
scores or cross-validation scores, certain features may be labeled as unimportant.
However, it is crucial to note that these features might still play a significant role in
practical scenarios. To address this issue, we initially chose a strong model that had
already demonstrated good performance before performing any feature selection.
The results obtained from this approach provide insights into the relevance of fea-
Qasim et al. Page 24 of 30

Figure 11 DET curve using the BERT approach

tures for that particular model. It is important to keep in mind that the relevance
of features may vary across different models.
Additionally, it is essential to consider the presence of correlations among features.
In cases where features are highly correlated, the importance of individual features
may be underestimated. Even if one of the correlated features is removed or altered,
the other feature may still drive the model to make similar predictions as before.
To address this, a clustering technique could have been employed using Spearman
rank-order correlations. By establishing a threshold and selecting a representative
feature from each cluster, the impact of correlated features could have been better
managed.
Furthermore, eliminating features that were estimated to have zero relevance may
not always be the best approach. It is possible that the zero relevance estimation
could be an indication of a correlation between those features. Taking this into
consideration, it would have been prudent to investigate and address correlations
among features before discarding them based on relevance estimation alone.
We also conducted an evaluation of the feature importance for the two highest
performing algorithms. During this analysis, we discovered that some of the features
on the list may not seem logical or intuitively important. It would be interesting to
investigate further and understand why these features are considered important by
the algorithms, as there could be instances where they are mistakenly identified as
such.
Qasim et al. Page 25 of 30

For instance, one of the features identified as ”mqasim,” which corresponds to


the username of the profile on the machine running the sandbox. This feature was
not identified during the cleaning step of our process. It appears frequently in the
reports because the sandbox needs to record the path to where actions or logs
are saved. Therefore, if there are numerous actions taking place, they will appear
frequently in the reports. It is important to note that cleansing the reports could
potentially improve the performance of our algorithms by removing such irrelevant
or misleading features.
Furthermore, we compared our approach to techniques described in the literature,
and the results were found to be consistent and closely aligned with those reported
in previous studies. However, it is important to note that accuracy alone should not
be considered as the sole criterion for evaluation, as it does not provide a complete
picture of the model’s performance.
Table 10 Comparitive analysis of the proposed model with published work.
Research Target Category Limitations Highest recorded accuracy
Dbot DGA botnets networks specific botnet 99.6
Alauthaman et al. P2P botnets networks P2P botnets,outdated 99.2
Aswathi et al 7 botnet families networks network patterns, outdated 99.6
Our project real-world botnets client sandbox evasion 99.7

5.2 Bert approach


As mentioned earlier, we initially eliminated all duplicated files, assuming that
our dataset consisted of unique reports. However, during the experiment, we en-
countered unsatisfactory results, which prompted us to investigate further. It was
discovered that although the reports themselves were unique, the data within the
reports was not distinct.
Specifically, we found that 6490 reports were duplicated due to limitations in the
sandbox’s ability to provide unique ”descriptions.” These duplicated reports had
identical performance characteristics. Additionally, there were cases where reports
had the same set of features and labels, but the labels overlapped. This inconsistency
had a significant impact on the model’s performance. In some instances, samples
with the same feature and label in both the test and training sets were easily
classified by the model. However, there were cases where a sample in the test set
had the same feature observed in the training set, but with contradictory labels,
leading to incorrect classification. There was also a possibility of two samples with
opposite labels being present in the training set, which could mislead the model.
Removing these 6490 duplicated reports from the data would result in an insuf-
ficient amount of data to effectively train and test a machine learning model. This
finding suggests that the data we selected as our feature vector was insufficient, and
we may need to explore other data sources within the reports. However, it is im-
portant to note that we have thoroughly investigated the available data and believe
that there is no additional useful information that can be utilized.

5.3 Comparison of all techniques


The comparative analysis shows that the bag of words technique produces better
results, as shown in the figure 12 below:-
Qasim et al. Page 26 of 30

Figure 12 Samples with contradicting label

Figure 13 Graph showing the comparison of all techniques


Qasim et al. Page 27 of 30

5.4 General Comments


The project is centered around analyzing reports generated by executing samples in
a sandbox environment. However, it is important to acknowledge that malware cre-
ators are aware of such analysis methods and have developed techniques to detect
and evade these environments. They can determine whether the execution environ-
ment is virtual, unwanted (such as a sandbox), or being debugged, and may alter
their behavior accordingly. In some cases, the malware may exhibit different behav-
ior or simply terminate its processes without executing any malicious actions. These
evasion techniques often involve examining registry entries or monitoring running
processes.
To make this project more resilient and effective, it is crucial to consider sandbox
hardening techniques [68]. Sandbox hardening involves implementing countermea-
sures to prevent or deter malware from detecting and evading the sandbox environ-
ment. By enhancing the sandbox’s security and preventing detection, the accuracy
and effectiveness of the botnet detection approach can be improved.
While the focus of this work is on botnet detection, a similar approach can po-
tentially be applied to detect other types of malware. The ultimate objective is to
enhance cybersecurity and make the digital space more secure against various forms
of malicious activities. Exploring the application of this methodology to different
types of malware can contribute to achieving this goal.

6 Conclusion
In conclusion, our investigation into botnet malware detection using machine learn-
ing has opened exciting possibilities for enhanced cybersecurity. We delved into
the realm of feature extraction, specifically from sandbox-generated reports, and
devised two innovative approaches. And boy, did they deliver!
By harnessing the power of a Bag of Words representation and employing smart
feature selection, we crafted a robust dataset that captured the essence of malicious
behavior. Our machine learning algorithms eagerly dived into this treasure trove
of information, learning patterns and trends that allowed them to predict with
astounding accuracy whether a file is harboring dangerous botnet malware. The
XGBoost classifier emerged as the shining star of our study, boasting an impressive
accuracy of 99.17% and an awe-inspiring ROC/AUC score of 0.9995. Not far behind,
the RandomForest classifier held its ground with an impressive accuracy of 98.38%
and a commendable ROC/AUC score of 0.9985. These results showcase the power
of machine learning in the realm of cybersecurity. Of course, no endeavor comes
without challenges and room for improvement. We carefully examined limitations
and drawbacks, paving the way for future enhancements. Sandbox hardening and
expanding our scope to tackle other forms of malware are exciting avenues that
await exploration.
In a world where cyber threats loom large, our research ignites hope. By pushing
the boundaries of feature extraction and harnessing the might of machine learn-
ing, we take a significant step toward securing the digital landscape. Let us march
forward, armed with knowledge and innovation, in the quest for a safer cyberspace.

Appendix
Text for this section. . .
Qasim et al. Page 28 of 30

Acknowledgements
Text for this section. . .

Funding
Text for this section. . .

Abbreviations
Text for this section. . .

Availability of data and materials


Text for this section. . .

Ethics approval and consent to participate


Text for this section. . .

Competing interests
The authors declare that they have no competing interests.

Consent for publication


Text for this section. . .

Authors’ contributions
Text for this section . . .

Authors’ information
Text for this section. . .

Author details
1 2
Department of Electrical Engineering, NUST, ISLAMABAD, PAKISTAN. Institute of Computer science, Denmark
Technical University, DTU, Kiel, Germany.

References
1. Baraz, A., Montasari, R.: Law enforcement and the policing of cyberspace. In: Digital Transformation in
Policing: The Promise, Perils and Solutions, pp. 59–83. Springer, ??? (2023)
2. Rane, S., Devi, G., Wagh, S.: Cyber threats: Fears for industry. In: Cyber Security Threats and Challenges
Facing Human Life, pp. 43–54. Chapman and Hall/CRC, ??? (2023)
3. Kaur, J., Ramkumar, K.: The recent trends in cyber security: A review. Journal of King Saud
University-Computer and Information Sciences 34(8), 5766–5781 (2022)
4. Thanh Vu, S.N., Stege, M., El-Habr, P.I., Bang, J., Dragoni, N.: A survey on botnets: Incentives, evolution,
detection and current trends. Future Internet 13(8), 198 (2021)
5. Negash, N., Che, X.: An overview of modern botnets. Information Security Journal: A Global Perspective
24(4-6), 127–132 (2015)
6. Falliere, N., Chien, E.: Zeus: King of the bots. Symantec Security Response (http://bit. ly/3VyFV1) (2009)
7. Prasad, R., Rohokale, V.: BOTNET, pp. 43–65. Springer, Cham (2020). doi:10.1007/978-3-030-31703-44 .
8. Antonakakis, M., April, T., Bailey, M., Bernhard, M., Bursztein, E., Cochran, J., Durumeric, Z., Halderman,
J.A., Invernizzi, L., Kallitsis, M., et al.: Understanding the mirai botnet. In: 26th USENIX Security Symposium
(USENIX Security 17), pp. 1093–1110 (2017)
9. Gaonkar, S., Dessai, N.F., Costa, J., Borkar, A., Aswale, S., Shetgaonkar, P.: A survey on botnet detection
techniques. In: 2020 International Conference on Emerging Trends in Information Technology and Engineering
(ic-ETITE), pp. 1–6 (2020). IEEE
10. Khattak, S., Ramay, N.R., Khan, K.R., Syed, A.A., Khayam, S.A.: A taxonomy of botnet behavior, detection,
and defense. IEEE communications surveys & tutorials 16(2), 898–924 (2013)
11. Owen, H., Zarrin, J., Pour, S.M.: A survey on botnets, issues, threats, methods, detection and prevention.
Journal of Cybersecurity and Privacy 2(1), 74–88 (2022)
12. Anderson, H.S., Roth, P.: Ember: an open dataset for training static pe malware machine learning models.
arXiv preprint arXiv:1804.04637 (2018)
13. Garcia, S., Grill, M., Stiborek, J., Zunino, A.: An empirical comparison of botnet detection methods. computers
& security 45, 100–123 (2014)
14. Ugarte-Pedrero, X., Graziano, M., Balzarotti, D.: A close look at a daily dataset of malware samples. ACM
Transactions on Privacy and Security (TOPS) 22(1), 1–30 (2019)
15. Wang, T.-S., Lin, H.-T., Cheng, W.-T., Chen, C.-Y.: Dbod: Clustering and detecting dga-based botnets using
dns traffic analysis. Computers & Security 64, 1–15 (2017)
16. Hwang, R.-H., Peng, M.-C., Nguyen, V.-L., Chang, Y.-L.: An lstm-based deep learning approach for classifying
malicious traffic at the packet level. Applied Sciences 9(16), 3414 (2019)
17. Cui, J., Long, J., Min, E., Mao, Y.: Wedl-nids: improving network intrusion detection using word
embedding-based deep learning method. In: International Conference on Modeling Decisions for Artificial
Intelligence, pp. 283–295 (2018). Springer
18. Oak, R., Du, M., Yan, D., Takawale, H., Amit, I.: Malware detection on highly imbalanced data through
sequence modeling. In: Proceedings of the 12th ACM Workshop on Artificial Intelligence and Security, pp.
37–48 (2019)
19. Alauthaman, M., Aslam, N., Zhang, L., Alasem, R., Hossain, M.A.: A p2p botnet detection scheme based on
decision tree and adaptive multilayer neural networks. Neural Computing and Applications 29(11), 991–1004
(2018). doi:10.1007/s00521-016-2564-5
20. Faculty of Engineering. https://www.uvic.ca/ecs/ece/isot/assets/docs/isot-datase.pdf
Qasim et al. Page 29 of 30

21. Shiravi, A., Shiravi, H., Tavallaee, M., Ghorbani, A.A.: Toward developing a systematic approach to generate
benchmark datasets for intrusion detection. Computers and Security 31(3), 357–374 (2012).
doi:10.1016/j.cose.2011.12.012
22. Feizollah, A., Anuar, N.B., Salleh, R., Amalina, F., Shamshirband, S., et al.: A study of machine learning
classifiers for anomaly-based mobile botnet detection. Malaysian Journal of Computer Science 26(4), 251–265
(2013)
23. Zago, M., Pérez, M.G., Pérez, G.M.: Umudga: A dataset for profiling dga-based botnet. Computers & Security
92, 101719 (2020)
24. Wojnowicz, M., Chisholm, G., Wolff, M., Zhao, X.: Wavelet decomposition of software entropy reveals
symptoms of malicious code. Journal of Innovation in Digital Ecosystems 3(2), 130–140 (2016)
25. Shiravi, A., Shiravi, H., Tavallaee, M., Ghorbani, A.A.: Toward developing a systematic approach to generate
benchmark datasets for intrusion detection. computers & security 31(3), 357–374 (2012)
26. Alauthaman, M., Aslam, N., Zhang, L., Alasem, R., Hossain, M.A.: A p2p botnet detection scheme based on
decision tree and adaptive multilayer neural networks. Neural Computing and Applications 29(11), 991–1004
(2018)
27. Botnet and Ransomware Detection Datasets - University of Victoria, URL =
https://www.uvic.ca/ecs/ece/isot/datasets/botnet-ransomware/index.php/, note = Accessed: 2023-01-23
28. Zhou, Y., Jiang, X.: Dissecting android malware: Characterization and evolution. In: 2012 IEEE Symposium on
Security and Privacy, pp. 95–109 (2012). IEEE
29. Cylance, URL = https://github.com/orgs/cylance/repositories?type=all, note = Accessed: 2023-01-24
30. Online malware analysis platform, URL = http://www.virustotal.com/, note = Accessed: 2022-11-29
31. Figthing malware and botnets, URL = https://abuse.ch/, note = Accessed: 2022-11-29
32. Interactive Online Malware Sandbox, URL = https://any.run/, note = Accessed: 2022-11-29
33. Sebastián, M., Rivera, R., Kotzias, P., Caballero, J.: Avclass: A tool for massive malware labeling. In: Research
in Attacks, Intrusions, and Defenses: 19th International Symposium, RAID 2016, Paris, France, September
19-21, 2016, Proceedings 19, pp. 230–253 (2016). Springer
34. Automated Malware Analysis, URL = https://cuckoosandbox.org/, note = Accessed: 2022-11-29
35. JSON - JSON encoder and decoder. https://docs.python.org/3/library/json.html
36. Soviany, S., Soviany, C.: Feature engineering. Principles of Data Science, 79–103 (2020).
doi:10.1007/978-3-030-43981-15
37. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv
preprint arXiv:1301.3781 (2013)
38. Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among languages for machine translation. arXiv
preprint arXiv:1309.4168 (2013)
39. Řehřek, R., Sojka, P., et al.: Gensim—statistical semantics in python. Retrieved from genism. org (2011)
40. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of
the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
41. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information.
Transactions of the association for computational linguistics 5, 135–146 (2017)
42. Matthew, E.P., Mark, N., Mohit, I., Matt, G., Christopher, C., Kenton, L.: Deep contextualized word
representations (2018). arXiv preprint arXiv:1802.05365 (1802)
43. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for
language understanding. arXiv preprint arXiv:1810.04805 (2018)
44. Papers with code - bert: Pre-training of deep bidirectional Transformers for language understanding.
https://paperswithcode.com/paper/bert-pre-training-of-deep-bidirectional
45. team, P.: PyTorch: An open source machine learning framework. Facebook Inc. (2021)
46. Corporation, N.: CUDA Toolkit 11.2. https://developer.nvidia.com/cuda-11.2.0-download-archive (2021)
47. Corporation, N.: cuDNN. https://developer.nvidia.com/cudnn (2021)
48. Wolf, T., Debut, L., Sanh, V., Chaplot, D.S., Delangue, J., Moi, A., Cistac, P., Louf, R., Funtowicz, M.,
Gardner, M., et al.: huggingface/transformers. https://github.com/huggingface/transformers
49. BayesSearchCV. https://scikit-optimize.github.io/stable/modules/generated/skopt.BayesSearchCV.html
50. Feldman, R., Sanger, J.: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data.
Cambridge university press, ??? (2007)
51. Vapnik, V.: Statistical learning theory wiley new york google scholar (1998)
52. Manevitz, L.M., Yousef, M.: One-class svms for document classification. Journal of machine Learning research
2(Dec), 139–154 (2001)
53. Hu, W., Le, R., Liu, B., Ji, F., Chen, H., Zhao, D., Ma, J., Yan, R.: Learning from positive and unlabeled data
with adversarial training (2020)
54. DeBarr, D., Wechsler, H.: Spam detection using clustering, random forests, and active learning. In: Sixth
Conference on Email and Anti-Spam. Mountain View, California, pp. 1–6 (2009). Citeseer
55. Mandic, D., Chambers, J.: Recurrent Neural Networks for Prediction: Learning Algorithms, Architectures and
Stability. Wiley, ??? (2001)
56. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint
arXiv:1607.01759 (2016)
57. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert: smaller, faster, cheaper and
lighter. arXiv preprint arXiv:1910.01108 (2019)
58. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.:
Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
59. Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text classification. In: Proceedings
of the AAAI Conference on Artificial Intelligence, vol. 29 (2015)
60. Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for
sequence modeling. arXiv preprint arXiv:1803.01271 (2018)
Qasim et al. Page 30 of 30

61. Chen, Y.: Convolutional neural network for sentence classification. Master’s thesis, University of Waterloo
(2015)
62. Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neural networks. 34th International
Conference on Machine Learning, Icml 2017 3, 2130–2143 (2017)
63. Goldberg, Y., Levy, O.: word2vec explained: deriving mikolov et al.’s negative-sampling word-embedding
method. arXiv preprint arXiv:1402.3722 (2014)
64. Ma, L., Zhang, Y.: Using word2vec to process big text data. In: 2015 IEEE International Conference on Big
Data (Big Data), pp. 2895–2897 (2015). IEEE
65. Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L., Brown, D.: Text classification
algorithms: A survey. Information 10(4), 150 (2019)
66. LeCun, Y., Bengio, Y., Hinton, G., et al.: Deep learning. nature, 521 (7553), 436-444. Google Scholar Google
Scholar Cross Ref Cross Ref, 25 (2015)
67. Tay, Y., Dehghani, M., Bahri, D., Metzler, D.: Efficient transformers: A survey. ACM Computing Surveys
55(6), 1–28 (2022)
68. Martin, D.: Hardening cuckoo sandbox against VM aware malware.
https://cybersecurity.att.com/blogs/labs-research/hardening-cuckoo-sandbox-against-vm-aware-malware

Additional Files
Additional file 1 — Sample additional file title
Additional file descriptions text (including details of how to view the file, if it is in a non-standard format or the file
extension). This might refer to a multi-page table or a figure.

Additional file 2 — Sample additional file title


Additional file descriptions text.

You might also like