2022 Iiccit
2022 Iiccit
Basrah , Iraq
Hasan Abdulkader
Mustafa Al-farttoosi
Dept. of Electrical and Computer
Computer and electrical engineering
Engineering (ECE)
Altinbas university
Altinbas University
Istanbul, turkey
Istanbul, turkey
[email protected]
[email protected]
Abstract— In this study, we discuss the most dangerous and Among the potential malicious attack, the most powerful
most widespread attacks, namely botnets. We will also discuss one is the botnet attack. Kaspersky lab reported in 2016 that
the bot attack mechanism and introduce methods using botnet-assisted DDoS attacks present 78.9 percent of all
machine learning (ML) and deep learning techniques to detect detected attacks [4]. In the future, Internet of things (IoT)
attacks entered by SMS or malware. This study investigates devices and UE may play the main role in health, industrial,
the application of several methods based on ML, including banking services, and other vital fields, so any threat
logistic regression, random forest, and deep neural networks. A exposed to these devices will be a big problem for
deep learning algorithm was applied to artificial neural institutions and factories.
networks (ANN) in more than one way. Datasets were divided
into two halves to obtain more accurate results than classical In establishments that use these devices, for the mobile
learning. Special preprocessing of datasets was applied to devices, the attacker may steal sensitive confidential or
improve the performance of classification algorithms. The private information or credential and financial information
obtained accuracy from deep learning was encouraging. A using botnet attacks that aim to create a distributed denial
result of 99.79% was obtained for the SMS attacks and 98.48% of service (DDoS). Similar to the botnet in legacy, mobile
for the malware attacks. network future botnet will be for 5G network.
Keywords—Botnet, Deep learning, Ham, Machine learning, TABLE I. TYPES OF A BOTNET ATTACK
Malware, Spam.
Attack Impact of Attack
Preventing a single system from servicing legitimate
I. INTRODUCTION DDoS
request
The next mobile generation (5G) will include many Active advertising of a commercial offering without
Adware the user’s permission or awareness
innovative technologies, such as device 2 device communication
Sending information to the botmaster about a
(D2D), machine 2 machine communication, massive MIMO,
victim’s activity such as credit card number,
autonomous vehicle, software-defined network (SDN) and spyware passwords, and other information that can be sold
virtual function network. This development will lead to an onthe black market
increase in the number of users and user equipment (UE) and Flooding people with emails disguised as messages
Email spam
IoT machines. It is predicted to reach 1.5 billion subscribers from people but containing malicious links
for eMBB by the end of 2024, according to the Ericson
mobility report [1]; 5G will connect approximately 7 trillion
wireless devices or things, shrink the average rate of services Networks of many UE (mobile or IoT) are under the
creation from 90 hours to 90 minutes [2]. 5G is the first mobile control of malicious actor called bot master that periodically
technology designed to meet the requirement of connected gives orders as a centralized 5G mobile botnet where the
devices for health, industrial applications, transportation, compromised devices will be controlled through central
banking, and many other IoT uses cases. However, that will command and control (C&C) server[5], the bot master will
increase the potential security challenges, safety issues, and be responsible for choosing the mobile devices that will be
cyber attacks since already most of the user equipment compromised by malware and turned into bots. This, in turn,
devices have an open system that makes these devices a will be a bot proxy server which is considered a mean of
potential and tempting target for hackers, especially IoT communication between the bot master and other slave bots.
devices, which are considered vulnerable to cyber attacks. In Also, the bot device will always receive requests from the bot
general, the following could be the main reason that makes master for a specific duration of time. These attacks will form
IoT devices and UE vulnerable to hacking, such as [3]: a very big network of malicious devices with unstable
topology because of user equipment movement from one
• Lack of encryption or poor encryption macro cell to another.
• Default password
• Poor support
• Lack of user awareness
Authorized licensed use limited to: ULAKBIM UASL - Altinbas Universitesi. Downloaded on January 14,2023 at 09:28:52 UTC from IEEE Xplore. Restrictions apply.
2022 Iraqi International Conference on Communication & Information Technologies ( IICCIT-2022) , Basrah University ,
Basrah , Iraq
83
Authorized licensed use limited to: ULAKBIM UASL - Altinbas Universitesi. Downloaded on January 14,2023 at 09:28:52 UTC from IEEE Xplore. Restrictions apply.
2022 Iraqi International Conference on Communication & Information Technologies ( IICCIT-2022) , Basrah University ,
Basrah , Iraq
advocated using feature selection and the Neural Network formats, such as comma-separated-values, JSON, Parquet,
model for SMS spam identification and got an excellent SQL database tables or queries, and Microsoft Excel. In that
accuracy rate of nearly 98%. case, our data were described as "CSV" file. In the following,
There are a few other studies on malware detection using we will explain how we use it in detail.
ML. In [16] “A Multi-Dimensional Machine Learning a) SMS: We discovered that the SMS dataset was not
Approach to Predict Sophisticated Malware” focused on evenly distributed. Unbalanced data sets were common, and
predicting the advanced malware that is comparable to the problem arose when ML algorithms tried to find these
Stuxnet by employing four separate aspects of the unusual situations in huge datasets where results were few.
Regression algorithm. Random Forest Regression and Linear Since classes had different memberships, the method
and Polynomial Regression are included in the features. preferred sorting in the class with the most cases, the
Linear and polynomial regression are inefficient with four majority class, while still presenting the illusion of a high-
algorithms, but random forest regression delivers superior fidelity model. The minority class, least present in the
predictions with additional data, according to the findings of
dataset, would not participate enough in the learning
his study.
process, and misleading accuracy was thrown out of the
In [17], “Machine Learning Aided Android Malware prediction models we built because of their unpredictable
Classification” has conducted research towards finding and nature. A data set was generated from two categories of data
classifying malware in mobile apps using ML. It was shown using the oversampling method to the spam when the
that the permission-based technique could distinguish transaction was fraudulent and inspiring due to its low
between malware and goodware in 89% of situations, while value. The proportion of ham was much higher than spam,
the source code analysis classification of performance was so the sampling method was used to overcome this problem,
above 95%. SVM had a 95.1% accuracy rate, while which was used to intensify the samples from the minority
ensemble learning had a 95.6% accuracy rate. and add duplicates of ML from the minority class to become
Hemalatha and Selvabrunda [18] have suggested ML an over-sampled dataset.
classifiers to identify the previous portable malware, with the b) Malware: Initially, we faced difficulty in training
mixed kernel function unique to support vector machines and the data due to the small number of valid programs, totaling
selected fundamental information, such as data content time 41,323 samples, compared to the number of malware
and order utilizing various network-based functions.
programs, which numbered 96,724 samples, which means
MalGenome's dataset is used in the calculation. In this case,
the implementation was based on a mixed kernel function the difference between malware and legit apps is about
using SVM, which yielded an accuracy of 96.89% when 55,401 samples, and it is a very huge difference. By splitting
compared to previous models. the datasets into two sub-datasets, we realized that one of
the models outperformed the other widely. We suggested
Furthermore, Cuan Bonan's study showed how they applying the method “df.sample()” for shuffling the data and
employed ML approaches to identify hazardous PDF implementing the method “df.reset_index()” that was
activities. First and foremost, the SVM classifier was built up proven to work effectively and here was how to write the
and capable of detecting 99.7% of the malware. Although a equation representing:
malicious file has easily fooled the classifier, the classifier
“malData.sample(frac=1).reset_index(drop=True)”
was cleaned by forging the data. According to a report, they
have successfully used a gradient-descent assault to thwart To make the data more balanced even after splitting the
the SVM algorithm [19]. data into two halves, and as part of preparing the data for
ML, we normalized the malware dataset using Python’s
III. METHODOLOGY function “lambda:”
“X_Data1.apply(lambda x: (x - x.min(axis=0)) / (x.max(axis=0) -
A. Dataset modification x.min(axis=0)))”
Due to the restriction of our topic to the mobile botnet, The objective of normalization was to let the values of
two datasets were chosen. The first one was the SMS which the features range between (0 and 1) according to the
consisted of spam and ham messages. For each message, the following equation:
dataset attributed two fields, a field for describing the
message and a message field. The total number of messages ()
in the dataset was 5,574 messages. Concerning the second
dataset, the MALWARE, which contained 138,047 The idea of dividing the dataset into two halves was
applications, had 54 features, such as “Size of Optional proposed, and algorithms were trained, which allowed later
Header, Address of Entry Point, Major Linker Version, to find more accurate results. Then, the accuracy measures
Minor Linker Version and Size of Code. ” Besides the were averaged to obtain a single value of accuracy.
features, the dataset attributed a description of the application
as “legitimate,” which could be True/False. Thus, the
program was developed on the platform “Google Colab”
using Python and specialized libraries, such as scikit-learn
and pandas. After that, we used the program to analyze the
data and associated manipulation of tabular data in
DataFrames. Pandas allow importing data from various file
84
Authorized licensed use limited to: ULAKBIM UASL - Altinbas Universitesi. Downloaded on January 14,2023 at 09:28:52 UTC from IEEE Xplore. Restrictions apply.
2022 Iraqi International Conference on Communication & Information Technologies ( IICCIT-2022) , Basrah University ,
Basrah , Iraq
The entire work was based on SK-Learn, for SMS, so we The sensitivity for the SMS dataset determines the ratio
split the two datasets with an average of 70% of the training between how many were correctly identified as positive to
data (1950 for both halves) and 3900, 30% of the test data how many were positive. In other words, Sensitivity
(837 for both halves) and 1,674. On the other hand, the measures how various sources of uncertainty in a
malware dataset was also split into two datasets with an mathematical model contribute to the model's overall
average of 80% of the training data (55219 for the first half uncertainty, and the equation below describes the scale of
and 55218 for the second half) which the total is 110437, sensitivity:
20% of the test data (13805 for the first half and 13805 for
the second half) which the total is 27610.
()
After that, we applied these datasets to the three
algorithms (Random Forest, Linear Regression, and The F1-score (for the malware dataset) is used to
Artificial Neural Network) and trained the models with compare ML algorithms and represent a statistical measure
datasets before and after oversampling. to rate performances and the quality of that model, so it can
be calculated as the equation below:
• Random forest: it should select random samples from
a given dataset, construct a decision tree for each
sample and get a prediction result from each decision
tree. After that, it performs a vote for each predicted ()
result, and lastly, selects the prediction result ‘Spam
or Ham’ in the SMS datasets and ‘Legit or non Legit
Where Precision is correct positive predictions relative to
apps’ in the malware dataset with the most votes as
total positive predictions and Recall is correct positive
the final prediction.
predictions relative to total actual positives
• Logistic Regression: a classification algorithm that is Where the components of the confusion matrix of 4 cells
used to predict the probability of a categorical given by the sequence “TP, FP, FN, TN”, “TP” represents
dependency of samples. Thus, the dependent variable “true positive,” which indicates the number of positive
here is the spam and ham messages, like a binary samples that were accurately categorized, “FP” shows a
variable that contains data coded as 1 (spam, “false positive” value, that is, the number of negative
malware) and 0 (ham, legit). samples classified as positive, “FN” means the “false-
• Artificial neural network: for the ANN model we negative” value which means the number of actual positive
used the Keras library. We firstly utilized a batch size samples classified as negative, “TN” represents the number
of 32 for the whole dataset, then the epoch was of accurately classified negative samples. Thus, the
adjusted to 100 and replaced “Spam & Ham” & confusion matrix allows us to visualize the performance of
“malware & legit” terms into ‘1 & 0.’ The ANN used the classification models.
in “SMS” has a total of “4” layers: one input layer-
“2” hidden layers- and 1 output layer and used 31393 IV. RESULTS
trainable parameters. While in “Malware,” the
The mobile botnet was divided into two parts, SMS and
number of layers was “5”: one input layer-“3” hidden
malware, and each dataset was divided into two halves to
layers- and one output layer, and used 1057 trainable
find quality measures for the first and second half. By the
parameters.
end, the average value of each measure was obtained to find
Lastly, Lastly, we compare the performance of models more accurate results by implementing algorithms (logistic
based on measures, such as “Sensitivity and Cohen’s Kappa regression, random forest, and Artificial Neural Network). In
(for SMS Dataset),” “f1-score (for Malware Dataset), and the following, details of the program developed on “Google
Accuracy for both datasets.” To this end, we compute the Colab” using Python and specialized libraries, and we
predictions of the ‘testing inputs’ and compare them to the noticed the following.
actual ‘testing outputs.’
85
Authorized licensed use limited to: ULAKBIM UASL - Altinbas Universitesi. Downloaded on January 14,2023 at 09:28:52 UTC from IEEE Xplore. Restrictions apply.
2022 Iraqi International Conference on Communication & Information Technologies ( IICCIT-2022) , Basrah University ,
Basrah , Iraq
A. SMS
Through previous studies, it turned out that most of the
results were when analyzing the complete and unbalanced
dataset and comparing these to results achieved by methods B. Malware
we proposed using one of the data balancing methods, and as Through the preliminary tests, the dataset was used
we explained the reason for this discrepancy in the results completely without preprocessing, and the best of the models
and the attempt to improve the results. Therefore, the focused heavily on the SVM algorithm. Results were similar
oversampling method was used, characterized by increasing to those explained in the literature review section. Before
the frequency of the minority category ‘spam’ and making it modifying the dataset, we noticed a dispersion in the logistic
equal to the majority category ‘ham.’ It necessitated the use regression readings because this algorithm is usually used for
of oversampling method to deal with this unbalanced data, as binary classification. Also, training ANN tends to constantly
each section contained 4825 messages, and the total number decrease the measured error between its output and reference
of data points was 9650. Using this method, we can obtain a output. What we have in the dataset, as features are huge
more complete and balanced dataset, which drastically numeric entries, which causes the saturation of neurons
improves the performance of the proposed models, as shown output and prohibits the convergence toward acceptable
in Table II . The results regarding the rate before the process models. Thus, the dataset is normalized, i.e., the huge values
of data imbalance were the worst results in terms of of features are converted to values between (0-1). The same
TABLE II. SMS DATASET RESULTS
Results
Classifiers Cohen's
Accuracy Sensitivity TP FP FN TN
Kappa
Logistic regression 97.52% 83.44% 88.91% 1453 0 33 186
Unbalancd
Average Random Forest 97.15% 80.33% 87.06% 1453 0 37 182
Result
Artificial Neural
Network 98.08% 86.92% 91.46% 1452 1 29 190
Logistic regression 99.83% 99.925% 99.653% 1450 4 4 1437
Oversampling
Average Random Forest 99.9% 100% 99.79% 1453 1 0 1441
Result
Artificial Neural
Network 99.757% 99.645% 99.51% 1453 1 10 1431
accuracy: ‘97.15%’ for the Random Forest algorithm and the method was applied as the dataset was separated into two
best result was ‘98.08%’ for ANN. The lowest result in equal sub-datasets. Found results by the models trained
sensitivity was reached ‘80.33%’ for the Random Forest separately by the first and second half datasets, and then the
algorithm, and the highest result was ‘86.92%’ for ANN. average value of quality measures are illustrated in Table I .
And finally, Cohen's kappa results reached the lowest value
for the Random Forest algorithm of 87.06% and the highest The results before the data normalization process were
value of 91.46% for the ANN algorithm. After using the lowest regarding accuracy in the data test, which
balancing methods, results became more accurate and stable: amounted to 70.06% for the logistic regression algorithm.
the accuracy of the ANN model reached 99.757% and of The best result was 98.45% for Random Forest. The lowest
Random Forest reached the highest value of 99.9%, the accuracy in training the data was 70.06% for the logistic
sensitivity of the ANN model reached 99.645%, and of regression algorithm and the highest accuracy was 98.27% in
Random Forest reached 100%, and finally the Cohen's kappa Random Forest. And finally, the F1-Score reached the lowest
measure was at the lowest value of 99.51% for the ANN value for logistic regression was resulted at 0%, and ANN
model and reached the highest value of 99.79% for the models reached 0.15%.
Random Forest algorithm.
86
Authorized licensed use limited to: ULAKBIM UASL - Altinbas Universitesi. Downloaded on January 14,2023 at 09:28:52 UTC from IEEE Xplore. Restrictions apply.
2022 Iraqi International Conference on Communication & Information Technologies ( IICCIT-2022) , Basrah University ,
Basrah , Iraq
However, the result of the Random Forest algorithm was Oct. 2016; https://securelist .com/kaspersky-ddos-intelligence-report-
97.34%. After normalization, the results became more for-q3-2016 /76464.
accurate, as the result lowered to an accuracy of 97.25% for [5] G. Mantas, N. Komninos, J. Rodriguez, E. Logota, and H. Marques,
“Security for 5G Communications,” Fundam. 5G Mob. Networks, pp.
logistic regression in the testing dataset and topped at an 207– 220, 2015, DOI: 10.1002/9781118867464.ch9
accuracy of 98.44% for ANN. Moreover, for training data, [6] SMS Spam Collection Data Set from UCI Machine Learning
the lowest accuracy was equal to 97.33% for the logistic Repository,http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collecti
regression algorithm, and the highest accuracy was equal to on
98.48% for the ANN model. Finally, concerning the F1- [7] SMS Spam Collection v.1, ”http://www.dt.fee.unicamp.br/∼tiago/
Score, the worst result was observed for the logistic smsspamcollection
regression algorithms, equal to 95.33%, but the best value [8] K. Inc. Kaggle, Retrieved from
was recorded for the ANN model, equal to 97.38%. https://www.kaggle.com/nsaravana/malwaredetection#Malware%20d
ataset.cs, 2019.
[9] S. J. Russell, & P. Norvig, Artificial intelligence: A modern approach
CONCLUSION (4th ed.). Pearson. (2021).
In this work, on defining bot networks attack detection, [10] M. I. Jordan, & T. M. Mitchell, Machine learning: Trends,
artificial intelligence-based methods were proposed. A perspectives, and prospects. Science, 349(6245), 255 –260.
https://doi.org/10.1126/science.aaa8415 , (2015).
preprocessing phase of datasets is presented, consisting of
[11] M. Rubin Julis, S.AIagesan: “Spam Detection In Sms Using Machine
splitting data into two halves and making it normalized and Learning through Textmining”, International Journal Of Scientific &
balanced. SMS attacks and malware detection using ML Technology Research Volume 9, Issue 02, February 2020.
methods, including ANN, random forest, and logistic [12] P. Navaney, G. Dubey, A. Rana, “SMS Spam Filtering using
regression to classify norm and harmful samples, were Supervised Machine Learning Algorithms.,” in 8th International
tested. The results of the experiments were ranked in order of Conference on Cloud Computing, Data Science & Engineering, 978 -
preference based on how the model performed concerning 1- 5386-1719-9/18/ 2018 IEEE.
SMS attacks. ML methods achieved high results, and this is [13] N. Nur Amir Sjarif, N F Mohd Azmi, Suriayati Chuprat, “SMS Spam
Message Detection using Term Frequency-Inverse Document
thanks to the oversampling of data when compared to other Frequency and Random Forest Algorithm,” in The Fifth Information
methods before the equilibrium process, which can Systems International Conference 2019, Procedia Computer Science
theoretically be used to better identify a variety of attacks, 161 (2019) 509-515, ScienceDirect.
bots and other forms of unwanted network behavior than [14] T. Xia, Xuemin Chen, “A Discrete Hidden Markov Model for SMS
previously created models. Spam Detection.,” in Applied Science, MDPI, Appl. Sci. 2020, 10,
5011; doi:10.3390/app10145011.
ML was also used for malware detection so that the [15] S. Sheikhi, M.T.Kheirabadi, A.Bazzazi, “An Effective Model for
results using the original dataset were sorted in order of SMS Spam Detection Using Content-based Features and Neural
preference, which changed the results after the data was Network”, International Journal of Engineering, IJE
shuffled and normalized with obtaining great results in ANN TRANSACTIONS B: Applications Vol. 33, No. 2, (February 2020)
221-228.
and logistic regression models, but the random forest
classifier saw a slight decrease in the results, and the findings [16] S. Bahtiyar, M. B. Yaman,., & C. Y. A. Altıniğne, multi-dimensional
machine learning approach to predict advanced malware. Computer
suggest that the dataset handled better after normalization. Networks, 160, 118–129, 2019.
https://doi.org/10.1016/j.comnet.2019.06.015
REFERENCES [17] N. Milosevic, A. Dehghantanha, & Choo, K. K. R. Machine learning
aided Android malware classification. Computers and Electrical
[1] “5G estimated to reach 1.5 billion subscriptions in 2024 - Ericsson.” Engineering, 61, 266–274, 2017.
[Online]. Available: https://www.ericsson.com/en/press- https://doi.org/10.1016/j.compeleceng.2017.02.013
releases/2018/11/5g-estimated-to-reach-1.5-billion-subscriptions-in-
2024--ericsson-mobility-report. [18] S. Hemalatha, “Mobile Malware Detection using Anomaly Based
Machine Learning Classifier Techniques,” International Journal of
[2] 5G-PPP Security WG, “5G-PPP Phase1 Security Landscape,” white Innovative Technology and Exploring Engineering (IJITEE), ISSN:
paper,2017. 2278-3075, Volume-8, Issue11S2, September 2019.
[3] I. Ahmad, T. Kumar, M. Liyanage, J. Okwuibe, M. Ylianttila, and A. [19] B. Cuan, A. Damien, C. Delaplace, & Valois, M. Malware detection
Gurtov, “Overview of 5G Security Challenges and Solutions,” IEEE in PDF files using machine learning. ICETE 2018 - Proceedings of
Commun. Stand.Mag., vol. 2, no. 1, pp. 36–43, 2018, DOI: the 15th International Joint Conference on e-Business and
10.1109/MCOMSTD.2018.1700063. Telecommunications, 2, 412–419, 2018.
[4] O. Kupreev, J. Strohschneider, and A. Khalimonenko, Kaspersky https://doi.org/10.5220/0006884705780585.
DDOS Intelligence Report for Q3 2016, tech. report, SecureList, 31
87
Authorized licensed use limited to: ULAKBIM UASL - Altinbas Universitesi. Downloaded on January 14,2023 at 09:28:52 UTC from IEEE Xplore. Restrictions apply.