A Comprehensive Survey On Identification of Malware Types and Malware Classification Using Machine Learning Techniques
A Comprehensive Survey On Identification of Malware Types and Malware Classification Using Machine Learning Techniques
Techniques
Nagababu Pachhala1 S. Jothilakshmi2 Bhanu Prakash Battula3
1 2 3
Research Scholar, Department of IT, Associate Professor, Department of Professor, Department of CSE, KKR
Annamalai University, IT, Annamalai University, & KSR Institute of Technology &
Annamalainagar, 608002, Tamil Nadu, Annamalainagar, 608002, Tamil Nadu, Sciences, Guntur-522017, Andhra
India India Pradesh, India
1 2 3
[email protected] [email protected] [email protected]
Abstract
Malware is malicious code that has an effect on the user I. INTRO DUCTIO N
or device and allows an attacker to do significant harm The defining word for harmful software is malware.
to the machine. Malware is a kind of computer virus Malware is a dangerous code which has an impact on
that increases in number and severity with each passing the user or the computer, and damages the machine by
day, posing a major danger to the security of the an attacker. Malware is a virus version, Trojan, Root,
Internet. This is a never-ending fight between security Ransomware, Worm, BotNet, Spyware, Adware,
experts and malware producers, with the sophistication Keyloggers, etc., and an extensive array of their
of malware increasing at the same rate as technological families is available, spread every day online.
advancement. Current state-of-the-art research focuses According to a study conducted by the AV-Test
on the development and use of machine learning Institute, every day 350,000 new hazardous
methods for malware detection owing to the capacity of applications and programmes are reported. The
these techniques to stay up with malware evolution and malware statistics are documented and registered for
keep up with the speed of technological advancement.
897 million malicious code in 2020, each harmful
The purpose of this study is to provide a systematic and
individual is categorised and saved properly.
comprehensive review of machine learning methods for
malware detection, with a special emphasis on deep Today, the world is evolving into a digital age [1]
learning techniques, in order to aid in the identification where cyber technology is an integral part of everyday
of malware. The paper's primary contributions are (i) it life. The use of computers and the Internet includes
provides a comprehensive description of the methods computing and access to knowledge and using
and features used in a traditional machine learning techniques such as the Internet of Things (IoT),
workflow for malware detection and classification; (ii) it cryptocurrency, etc. Today's world discusses the
examines the challenges and limitations of tradi tional digital economy for cyber collections [2]; such a deep
machine learning; and (iii) it examines recent trends and
computer involvement and many other innovations
progress in the field, with a particular emphasis on deep
present the digital world with new challenges.
learning approaches. Furthermore, (iv) it addresses the
Malicious software, also known as malware, is a
research problems and unresolved obstacles associated
malicious program intended to target computer
with state-of-the-art methods, and (v) it discusses the
systems, device hijacks, file deletion, robbery,
future directions of study in the field. A better
knowledge of malware detection and the new advances spamming, and malware downloads. The malware
and research paths being explored by the scientific program is designed for malicious activities. The list
community to combat the issue is provided by the survey
of malicious activities is widespread and grows at a
results, which aid researchers in their research efforts. rapid and frequent pace with new entries.
Keywords: Malicious S oftware, S ystem Damage, With the tremendous growth rate of cybercrime, it is
Antivirus S oftware, Malware Types, Malware Detection, clearly unreasonable to study and grasp the enormous
S tatic Analysis, Dynamic Analysis, Machine Learning. malware [3] [4] manually. Analysts are aided by the
fact that very little original software is generated by
the developers primarily using code and code trends
to reuse new malware. The biggest downside and
challenge for analysts is the malware operation to
inherit the patterns and similarities between the ways
in question. To take advantage of the similarities and malware structure remains the same. In contrast, the
expected trends in malware, the anti-malware industry latter notes that maintaining the activity itself changes
has begun to use the principle of machine learning in the malware's form, resulting in creating a new system
which the machines are trained to discover and after each iteration. It is hard to detect and isolate this
acknowledge the inherited patterns. Machine learning complex property of the malware. Signature-based,
and malware detection are multiple fields with several Heuristic-based, normalization, and computer
overlaps. education are the most effective techniques for
malware detection. Machine learning has become a
As the Internet is growing quickly, malware is now well-known solution to malware defenders in recent
one of the biggest cyber hazards. All malware years.
programmes such as information stealing, snooping,
etc. may be referred to as malware. Kaspersky Lab’s II. TYPES O F MALWARE AND DIFFERENT MALWARE
[5] described the malware as a "computer program ANALYSIS MO DELS
designed to infect and multiply harm to a legitimate 2.1 Malware types
user's computer." Although the diversity of malware
is rising, anti-virus scanners cannot match the security It is helpful to identify the problem so that malware
needs that lead to millions of hosts being attacked. In methodologies and reasoning are better understood.
2019, according to Kaspersky Labs 12,989,287 hosts Depending on its function, malware may be divided
had been targeted and separate malware items were into different groups. The types of malware are
found. In particular, Juniper Research (2016) forecasts
that the cost of data violations will rise worldwide to Virus: This is the simplest type of software. It is just
$3.7 billion in 2020. any software piece that has been loaded, launched,
and repeated (modified) without user permission or
Furthermore, due to the high availability of attacking other software.
resources on the Internet, the degree of competence
necessary for malware creation decreases. A high Worm: This form of malware is very much like a
level of anti-detection techniques and the ability to virus. The difference is that the worm will propagate
purchase black-market malware give everyone a to other machines across the network.
chance to become an attacker [6], not depending on
Trojan: This malware class is used to describe
the level of skills. Current studies have shown that
malware types that can appear as legitimate software.
script kiddies are produced or automated with more Thus, social engineering is the general propagation
and more attacks. Therefore, malware protection of
vector used in this class, making people trust that they
computer systems for individual users and companies download legitimate apps.
is one of the most critical cyber security tasks since
even a single attack may cause compromises to data Adware: The only aim of this type of malware is to
and adequate loss. The need for reliable and prompt display computer ads. Adware may also be viewed as
detection methods [7] is dictated by massive loss and a spyware subset and its aim is to create revenue for
repeated attacks. Current static and dynamic developers.
processes, especially when dealing with zero-day
attacks, do not provide efficient detection. Machine- Spyware: As the name suggests, spyware can call the
based learning methods can also be used. Figure 1 malware that allows spyware. Typical spyware
shows several malware detection approaches . practices include monitoring the search history to
transmit custom advertising to third parties and
tracking activities to sell it after that.
passwords, bank card numbers, and other vulnerable delete the malicious code's dynamical features,
data. including CWSandbox, Anubis, CAT,
TRACKTRAK, etc.
Ransomware: This Malware is meant to encrypt all
the data on the computer and requests a victim to
transfer a certain amount of money to get the
decryption key. Typically, a ransomware-infected Memory-damaging Malware Analysis
computer is "frozen," so the user cannot access a file, The test procedure of the spiteful code[9] [10] after it
and the screen is used to provide information about has been executed is known as a memory-damaging
attackers' requests.
malware analysis. Memory analysis features include
2.2 Different malware analysis models shared resources, application programs, hooking
detection, network services, rootkits link, hidden
There are three types of Malware Analysis: Static, objects, injection code, etc. Memory analytical
Dynamic, and Memory Malware Analysis , resources include volatility, pin tools, Valgrind, etc.
represented in Figure 2.
This survey aims to review and systematize existing
literature to promote malware analysis using machine
Malware learning techniques.
Analysis
III. MACHINE LEARNING MO DELS
under different conditions, Random Forest, BayesNet, characteristics are used to classify malware using the
MLP and Support Vector Machine classifiers are K-means, Expectation-Maximisation, and Hidden
used. Both malware samples from VirusTotal dataset Markov algorithms. The Expectation-Maximization
are initially examined for 15 cross -validation cycles, results provide better accuracy among the clustering
and the findings of Random Forest with 96% algorithms. Makandar et al. [9] summarize malware
precision are produced. analysis and detection technology with various
malware types.
AlAhmadi et al. [3] proposed a new technique to
malware classification. This is a three-phase process. The automatic system for detecting unknown malware
In the first stage, malware variants are fed into samples using neural networks is given by Kosmidis
network traffic and extracted thereafter. The variant et al. [10]. The malware is classified by perceptron,
families are built and encoded using the network decision tree, closest centroid, stochastic gradient,
change and the input instability is verified. When multilayer perceptron, random forest algorithms. In
similarities are derived from the malware family, the Random Forest, average accuracy results and test time
sequence is extracted and the flow values are are also taken into account as parameters. Sahay et
compared to similarities using binary likeness, al.[11] grouped malware-dependent executable using
Levenshtein distances, cosine similitude, interflow optimal K- means clustering and these groups used
distance, and N-flow mining, which are taken into training features for detection. They concluded that
account as the outcome of the second phase. In the the proposed approach provides 78 percent of
third phase, a model is trained and created for profile accuracy in finding unknown malware. Ahmadi et al.
extraction characteristics. Machine learning methods [12] used the malware data collection and the hex
like KNN and Random Forest are utilised in the dumping-based features of Microsoft, and extracted
classification process. Finally, it is noted that the them from disassembled data. GBoost classification
proposed classifier achieves 95.5 percent of accuracy. algorithms were used for classification. The authors
registered an accuracy of 91.8 percent.
The malware detection system has been launched by
Khan et al. [4]. The methodologies for malware Drew et al. [13] used polymorphic malware
sensing are employed in remote and regional analyses. classification using the Super Threaded Reference
A file is verified whether it is malicious or benign Free Alignment-Free N sequence Decoder
with the help of signatures. Various anti-virus tools (STRAND). The Algorithm State Machines (ASM)
are utilised for the analysis of malware and APIs sequence model was presented in their method, and
during isolated inspections. Analysis includes the use precision obtained by cross-validation was more than
of anti-virtual machines, anti-debuggers, analysing of 98.59 percent. In volume and diversity of
URLs, string analysis and packaging. Ronen et al. [5] programming versions, traditional safety protection
include the standard data set, which has been measures are not sufficient that is analysed by Souri et
announced by numerous malware as a challenge for al. [14]. Dynamic analysis of malware during runtime
the Kaggle competition. Ye et al.[6] submitted an may secure the model from malicious programming.
investigation into malware detection using smart This article proposed a framework to analyse malware
malware detection technologies for data mining. They [15] behaviour using machine learning automatically.
depict two phases in which features are extracted and
classified as crucial processes in the analysis and Hashemi et al. [16] proposed an entirely new
detection of malware. They reviewed research Windows platform based solution to polymorphic
activities from 2011 to 2016 including issues related malware detection. Polymorphic computer viruses are
to malware identification and data mining solutions. much more advanced and challenging to discover than
their original versions. It takes a lot of time to catch
Wang et al. [7] introduced the design and execution of them. A two-stage approach is used to evaluate them:
a sandbox, extractor and categorization. Mainly three first stage is creating both known and unknown
steps are considered for the tasks of collectors, mining malware for the API call sequences; second, sequence
workers and classifiers. The PinFWSandbox module restructuring and distance between the two data
in the collector which collects dynamic data, log file points.
data, and passes the extractor stage, as well as static
analysis and passionate performance. The extractor Souri et al. [17] proposed practical methods for
extracts all static characteristics, as well as dynamic developing warning correlated attack scenarios with
instructions and features. The classifier integrates all intrusion preconditions and effects. Their approach is
models, including the product of individual model based on the observation that alarms in a sequence of
classifications, system call output classifications, and attacks are not isolated but linked to multiple phases.
instantaneous classification with dynamic outcomes. Earlier stages are being prepared for the later steps.
They suggested a formal structure that would
Pai et al. [8] used clustering algorithms to classify represent warnings using the idea of hyper-alerts [18]
malware. Static characteristics and their ratings are with their conditions and consequences.
derived from the opcode sequences. These static
Several malware detectors have to break down implemented, and that has less background traffic.
malicious code to generate assembly code for Therefore, a system is proposed for the detection and
analysis. Palumbo et al. [19] described scenarios in prevention of unknown malware attacks. The
which malware masks instructions to avoid static disadvantages indicated must be overcome by
analysis. They investigate static detection techniques developing an efficient extraction model which
by modelling the dynamic usage of the stack, which is enhances the accuracy of malware detection. To
used in metamorphic viruses. All the virous stages categorise malware and to prevent it from providing
will be avoided if memory stacks are used in a more systems security, an extractor model must be built for
sophisticated way. Many viruses [20], for example, the rule set.
contain ambiguous calling instructions for static
analysis breakdown. VI. CO NCLUSIO N
Malicious software being an increasing security
Two common issues with behavioural block threat, malware detection continues to be an essential
identification strategies and circumvention of research issue. An inquiry into current malware
monitoring points have been established by identity models was carried out using machine
Narayanan et al. [21]. For any commercial AV learning techniques. Malware detection systems have
solution today, all these areas pose problems. The been compared and evaluated on the basis of a
switch to the disc processor eliminates the difficulties number of critical aspects, including classification
of circumvention and allows partial solutions to the approaches, analytical methodology, dataset number,
other issues. One feature of the signature detection precision and analysis. Research on malware detection
that varies from the host-level is that conduct has already proved that machine learning is correctly
detectors only detect malware and it does not detect classified. It is hoped that more constructive learning
other anomalies. methods are developed with machine learning,
Wu et al.[22] explored smartphone-based malware ensemble learning and deep learning. These
detection model for animal health protection contributions can be linked to important fields of
component with biological resistance system using inquiry. A new mix of aims, features and algorithms
both static malware analysis and malware element can be investigated in order to increase accuracy
investigations. Due to the precisely assessed vector above the existing state of the art. Moreover, as some
coding, the static and dynamic features are classes of algorithms have never been used for some
distinguished and antigens are created. In addition, 34 purpose, new ways can be offered for further research.
malwares and 25 benign files were compiled to study An investigation of malware can provide other ideas
samples. to be pursued. The entire field of study focuses on the
development of appropriate malware testing
Bat-Erdene et al. [23] introduced a technique to standards. This paper provides a brief survey on the
characterize the packaging algorithms of unknown models available for malware detection. The new idea
packaging. Firstly, they estimated entropy of a given of malware analysis economics can drive future
executable and changed them into typical research routes when establishing a malware testing
representations by entropy estimates of a particular environment where appropriate tuning methods can be
memory region. They used symbolic approximation provided to balance conflicting metrics and improving
aggregate, which is considered to be viable for the security levels in the network.
enormous knowledge shifts. Secondly, images are
transmitted using managed learning-ordering
techniques, i.e., Naive Bayes and Support Vector REFERENCES
machines for computerization.
[5]. Ronen, R., Radu, M., Feuerstein, C., Yom-T ov, E., & [20]. Mohamed GAN, Ithnin NB (2018) SBRT : API signature
Ahmadi, M. (2018). Microsoft Malware Classification behaviour based representation technique for improving
Challenge. arXiv preprint arXiv:1802.10135. metamorphic malware detection. In: Saeed F, Gazem N,
[6]. Ye, Y., Li, T ., Adjeroh, D., & Iyengar, S. S. (2017). A Patnaik S, Saed Balaid AS, Mohammed F (eds) Recent
survey on malware detection using data mining trends in information and communication technology.
techniques. ACM Computing Surveys (CSUR), 50(3), Proceedings of the 2nd international conference of
41. reliable information and communication technology
[7]. Wang, C., Ding, J., Guo, T ., & Cui, B. (2017, (IRICT 2017). Springer International Publishing, Cham,
November). A Malware Detection Method Based on pp 767–777
Sandbox, Binary Instrumentation and Multidimensional [21]. Kumar, S. A., Babu, E. S., Nagaraju, C., & Gopi, A. P.
Feature Extraction. In International Conference on (2015). An empirical critique of on-demand routing
Broadband and Wireless Computing, Communication protocols against rushing attack in MANET .
and Applications (pp. 427-438). Springer, Cham. International Journal of Electrical and Computer
[8]. Pai, S., Di T roia, F., Visaggio, C. A., Austin, T . H., & Engineering, 5(5).
Stamp, M. (2017). Clustering for malware classification. [22]. Narayanan A, Chandramohan M, Chen L, Liu Y (2017)
Journal of Computer Virology and Hacking T echniques, A multi-view context -aware approach to Android
13(2), 95-107 malware detection and malicious code localization.
[9]. Makandar, A., & Patrot, A. (2017). Overview of Empir Softw Eng. https://doi.org/10.1007/s10664-017-
malware analysis and detection. In IJCA proceedings on 9539-8
national conference on knowledge, innovation in [23]. Wu B, Lu T , Zheng K, Zhang D, Lin X (2014)
technology and engineering, NCKIT E (Vol. 1, pp. 35 - Smartphone malware detection model based on artificial
40). immune system. China Commun 11:86–92.
[10]. Kosmidis, K., & Kalloniatis, C. (2017, September). [24]. Bat-Erdene M, Park H, Li H, Lee H, Choi MS (2017)
Machine Learning and Images for Malware Detection Entropy analysis to classify unknown packing
and Classification. In Proceedings of the 21st Pan - algorithms for malware detection. Int J Inf Secur
Hellenic Conference on Informatics (p. 5). ACM. 16(3):227–248.
[11]. S. K. Sahay and A. Sharma, “Grouping the Executables [25]. Cui B, Jin H, Carullo G, Liu Z (2015) Service-oriented
to Detect Malwares with High Accuracy,” Procedia mobile malware detection system based on mining
Computer Science, vol. 78, no. June, pp. 667–674, 2016. strategies. Pervasive Mob Comput 24:101–116.
[12]. M. Ahmadi, D. Ulyanov, S. Semenov, M. Trofimov, and [26]. Fan Y, Ye Y, Chen L (2016) Malicious sequential
G. Giacinto, “Novel Feature Extraction, Selection and pattern mining for automatic malware detection. Expert
Fusion for Effective Malware Family Classification,” Syst Appl 52:16–
ACM Conference Data Application Security Priv., pp. 25. https://doi.org/10.1016/j.eswa.2016.01.002
183–194, 2016 [27]. Martín A, Menéndez HD, Camacho D (2016)
[13]. J. Drew, M. Hahsler, and T . Moore, “Polymorphic MOCDroid: multi-objective evolutionary classifier for
malware detection using sequence classification Android malware detection. Soft Comput 21:7405 –
methods and ensembles,” EURASIP J. Inf. Secur., vol. 7415.
2017, no. 1, p. 2, 2017. [28]. Sarada, K., Narayana, V. L., Gopi, P., & Pavani, V.
[14]. Souri A, Norouzi M, Asghari P (2017) An analytical (2020). An iterative group based anomaly detection
automated refinement approach for structural modeling method for secure data communication in networks.
large-scale codes using reverse engineering. Int J Inf Journal of Critical Reviews, 7(6), 208-212.
T echnol 9:329–333. https://doi.org/10.1007/s41870-017- [29]. Gopi, A., Babu, E. S., Raju, C. N., & Kumar, S. A.
0050-7 (2015). Designing an Adversarial Model Against
[15]. Souri A, Navimipour NJ, Rahmani AM (2017) Formal Reactive and Proactive Routing Protocols in MANET S:
verification approaches and standards in the cloud A Comparative Performance Study. International
computing: a comprehensive and systematic review. Journal of Electrical & Computer Engineering (2088 -
Comput Stand 8708), 5(5).
Interfaces. https://doi.org/10.1016/j.csi.2017.11.007 [30]. Shakya, Subarna, Lalitpur Nepal Pulchowk, and S.
[16]. Hashemi H, Azmoodeh A, Hamzeh A, Hashemi S Smys. “Anomalies Detection in Fog Computing
(2017) Graph embedding as a new approach for Architectures Using Deep Learning.” Journal: Journal of
unknown malware detection. J Comput Virol Hacking T rends in Computer Science and Smart T echnology
T ech 13:153–166. https://doi.org/10.1007/s11416-016- March 2020, no. 1 (2020): 46-55.
0278-y
[17]. Souri A, Asghari P, Rezaei R (2017) Software as a
service based CRM providers in the cloud computing:
challenges and technical issues. J Serv Sci Res 9:219–
237. https://doi.org/10.1007/s12927-017-0011-5
[18]. Chowdhury M, Rahman A, Islam R (2018) Malware
analysis and detection using data mining and machine
learning classification. In: Abawajy J, Choo K-KR,
Islam R (eds) International conference on applications
and techniques in cyber security and intelligence:
applications and techniques in cyber security and
intelligence. Springer International Publishing, Cham,
pp 266–274
[19]. Palumbo P, Sayfullina L, Komashinskiy D, Eirola E,
Karhunen J (2017) A pragmatic android malware
detection procedure. Comput Secure 70:689–
701. https://doi.org/10.1016/j.cose.2017.07.013