0% found this document useful (0 votes)
11 views

document_malware

Uploaded by

Heisenberg
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

document_malware

Uploaded by

Heisenberg
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

International Journal of

INTELLIGENT SYSTEMS AND APPLICATIONS IN


ENGINEERING
ISSN:2147-67992147-6799 www.ijisae.org Original Research Paper

Machine Learning Approach for Malware Detection and Classification


Using Malware Analysis Framework
D Anil Kumar 1, Susant Kumar Das2

Submitted: 27/10/2022 Revised: 18/12/2022 Accepted: 05/01/2023


Abstract: The world's digitalization is currently being threatened by the daily appearance of new and complicated viruses. As a result,
the conventional signature-based approaches for malware detection are practically rendered useless. Modern research studies have
demonstrated the effectiveness of machine-learning algorithms in terms of malware identification. In this study, we suggested a system
to identify and categorize various files (such as exe, pdf, PHP, etc.), and API calls as benign and harmful utilizing two-level classifiers,
namely Macro (for malware detection) and Micro (for classification of malware files as a Trojan, Spyware, Adware, etc.). One of the
most used data mining (DM) methods is classification. In this research, we describe a classification technique for DM for malware
discovery. On the basis of the characteristics and behaviors of each virus, we suggested many categorization approaches to identify
malware. The malware traits have been identified using a dynamic analysis technique. Our solution executes sample files in a virtual
environment using Cuckoo Sandbox to generate static and dynamic analysis reports. Additionally, utilizing the data produced by the
Cuckoo Sandbox, a unique feature selection, and extraction segment has been produced that operates based on static, behavioral, and
network analysis. Machine learning models are created utilizing the Weka Framework and training datasets. The experimental findings
utilizing the suggested framework demonstrate high rates of detection and classification using various Machine Learning Algorithms.

Keywords: Malware Detection; API-call; Static and dynamic analysis; malware classification; behavior-based analysis.

1. Introduction 1.1. Malware Detection


One of the biggest hazards on the Internet today is The use of signature-based and behavioral-based
malicious software or malware. Users download many approaches, two well-known detection methods, is made.
kinds of computer applications on a massive scale from the But signature-based methods are unable to identify Zero-
Internet. Online black markets are used by hackers to create Day attacks. Additionally, it cannot identify sophisticated
software that violates system security. This gives hackers a new malware. On the other hand, it is highly challenging
significant incentive to alter and make harmful code more to properly describe the whole variety of appropriate
sophisticated in an effort to create more uncertainty and behaviors that a system should show when employing
reduce their chances of being discovered by anti-virus behavior-based methodologies. The majority of malware
software. As a result, accessing the Internet is becoming detection systems typically employ static methods like
riskier and riskier as a result of growing dangers from signature-based and anomaly-based methods of detection.
malware [1], which is distributed over the Internet in the While some systems attempt to discover irregularities in
form of files and software. the code structure, others use signature matching to assess
Malware is a harmful application that is used to violate a whether a program is malicious. Static approaches
system's data availability, confidentiality, and integrity investigate malware programs without running them in
policies. Malware comes in a variety of forms, including order to understand the code structure [4]. When doing
viruses, Trojan horses, spyware, rootkits, trapdoors, etc., dynamic analysis, malware is run in a virtual environment
depending on how they pose dangers to the system. The to track its network interactions and Windows API calls.
overall quantity of malware has increased dramatically These observed API call data are utilized by dynamic
since 2008 and reached more than 583 million, in March malicious program detection techniques to identify
2020, according to AV-Test [2]. Finding these files before harmful behavior. The function names, arguments, and
they violate the system's security perimeter is crucial given return values of an executable are contained in API-call
the increased incidence of malware. According to the information. Through the use of the occurrence and
report, the malware detection system comprises the duties arrangement of API calls, dynamic approaches attempt to
of malware analysis [3]. derive distinguishing characteristics to identify malware
_______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
programs [5].
1&2
Berhampur University, Odisha, India As per the Malware detection statistics by AV-Test
ORCID ID : 0000-0003-3998-226X
institute, there are more than 1 billion malware programs
* Corresponding Author Email: [email protected]
out there spreading every year. Since 2013 it was spreading

International Journal of Intelligent Systems and Applications in Engineering IJISAE, 2023, 11(1), 330–338 | 330
exponentially. Nearly 560,000 new malware pieces are process of identifying and locating the malware. Static
detected on a daily basis worldwide. As per the last analysis, which may identify harmful code and place it in
statistics, near about 17 million new malware are reported one of the available collections depending on various
on a monthly basis. As per Sonic Wall data, more than 3.2 learning techniques, is a technique used in software
billion malware in the 1st half of 2020. As per google analyzing approaches. Static analysis uses binary codes to
statistics, nearly 7% of new websites were affected every identify harmful files and viruses. The biggest drawback of
year and each week it was reported that nearly 50 websites static analysis is the absence of the program's source codes.
contained malware. As per Symantec, around 20 million It is important to note that extracting binary codes is a
IoT devices were malware affected and out of which 75% difficult and intricate task.
were through routers. As per Statista China is the most The dynamic analysis, in contrast, detects dangerous
malware affected globally with 47% followed by Turkey scripts based on their runtime behavior [11]. Dynamic
with 42% and Taiat with 39%. As per checkpoint, the analysis, which also refers to behavioral analysis and
viruses are spread through .exe files and the malware is observation of behavior and system operation, is the term
generally delivered through emails. used to describe the examination of runtime code [12]. The
Different mobile malware statistics have been proposed. A infected files need to be run on a virtual system via a
few of them are listed below: dynamic analysis process [13]. To manage the expanding
As per Kaspersky total number of mobile malware number and variety of malware, dynamic analysis can be
surpassed 28 million during the 1st half of 2020. And employed in conjunction with classification and clustering
around 14 million are detected in each quarter every year. techniques. The approaches for classifying malware aid in
Adware is one of the most common pieces of mobile the assignment of unidentified malware to known families.
malware. Malware categorization is therefore employed to filter
Because malware is becoming more prevalent in unknown instances, which lowers analysis costs.
technology, understanding how to guard against it is a The following are some of this paper's contributions:
crucial component of malware detection using machine • Putting out a behavioral analysis detecting system.
learning techniques. In general, data mining techniques • Introducing software that converts an XML file
identified a group of malware applications in the public containing a malware behavior executive history into a
that included both harmful executable and innocuous WEKA input that is appropriate.
software packages [6]. Typically, there are two different • Examining several categorization techniques using a
types of data mining algorithms based on supervised virus case study.
learning and unsupervised learning techniques. • Comparing the experimental findings from the WEKA
Classification algorithms are the supervised learning tool, such as the proportion of correctly classified
techniques that are required for the exercise with the data instances, and the accuracy optimistic ratio.
set [7]. The unsupervised learning techniques, known as • To create a behavioral antivirus, the optimal
clustering algorithms, seek to analyze the organization of categorization approach based on critical malware
data into several clusters [8]. detection criteria is being tested.
The overall arrangement of this article is as follows:
1.2. Malware Analysis
Section 2 discusses some historical context and related
This includes static, dynamic, and hybrid analyses. We research in virus detection and data mining approaches.
developed a malware analysis solution utilizing a machine The behavioral analysis of the malware is shown in Section
learning approach to distinguish between benign and 3. In this part, using a real-world case study, we provide a
malicious files in response to the aforementioned novel method for deciphering malware behavior and
limitations of the existing techniques. In order to converting dangerous files into data mining files. The
effectively and efficiently identify and categorize malware, classification and prediction methods used with the data
we have suggested an intelligent malware analysis mining platform are also described in this section. Then,
methodology in this study. using the WEKA tool, we apply some of the well-known
Malware programs are often divided into categories categorization techniques to our actual case study. Section
including worms, viruses, trojan horses, spyware, back 4 summarizes the assessment and experimental findings.
doors, and rootkits [9]. Using signature-based techniques Section 5 brings the conversation and the work to a close.
is the cornerstone of conventional and customary
approaches to malware identification. Researchers have 2. Related Works
recently tried to propose more trustworthy methodologies
The background information and some associated efforts
for malware identification with malware behavior after
for malware detection in data mining approaches are
being frustrated by outdated methods' failure to identify
covered in this part. First, we quickly go through data
malware or polymorphic dangerous files [10]. Static
mining methodology based on malware and other system
analysis and dynamic analysis have both been used in the
classification techniques. Researchers recently revealed

International Journal of Intelligent Systems and Applications in Engineering IJISAE, 2023, 11(1), 330–338 | 331
various malware analysis methodologies. A data mining effectiveness of several machine learning methods and
technique was put out by Schultz et al. [14] to identify new discovered that SVM (support vector machine)
dangerous files during runtime execution. Their approach outperformed the others. While leveraging supplied
was based on three distinct sorts of DLL calls, including arguments, Thakur et al. [23] built API calls with in-depth
the binary's list of DLLs utilized, the list of DLL functions analysis. They also attempted to categorize and evaluate a
used, and the number of various systems calls used inside big quantity of malware. They employed a mix of
each DLL. Additionally, they use signature techniques to characteristics in different studies to reach a high
check the byte ordering that was retrieved from an categorization rate[24]. We go into further depth about our
executable file's hex-dump (a hexadecimal schema of suggested approach to malware analysis in the next section.
computer data). This method's primary structure is based
on the Naive-Bayes (NB) algorithm. The experimental 3. Malware Behavior Analysis
findings were compared using conventional signature- As seen in Table 1, we have considered malware datasets
based techniques. for malware behavioral analysis techniques. It consists of
Dynamic analysis approaches, which examine program two malware datasets. This approach will use a suggested
activities while running in a secure environment, have been program to transform an XML file containing the executive
used in several research. By examining a large number of history of malware action into a non-sparse matrix [25].
malware mutation files, Jeffrey, N et al. technique [15] This app was created using the VB.Net programming
suggests typical patterns of malware programs. Based on language. A screenshot of our proposed application's XML
the frequency of API calls, Amer et al [16] dynamic converter to a non-sparse matrix. The amount of library file
malware detection technique is suggested. By tracking API calls targeted by malicious program and their volume are
calls Zou et al. [17] examine the malware executables' API- two components of turning each XML file into an
call rates and sequences. In order to identify malware appropriate WEKA input. For instance, in Box 1, the
variations, Schofield et al. [18] provide an approach that malware has called the XML library file ntdll.dll 16 times,
builds representations of malware behavior by mapping ranging (0, 2). We next convert this matrix into the WEKA
API calls to colors. Function call patterns are used with a input data set [26]. Some classification algorithms will
Hidden Markov Model to categorize malware. Algorithms come before the training techniques. The new data set virus
for sequence alignment are combined with API call will choose the classifications with the greatest
sequences. In order to increase the effectiveness of the performance for the test platform. Finally, a behavioral
detection algorithms, non-essential API functionalities are antiviral may be developed using this process [27]. We
removed. Text-mining methods are used on the API call employ 10540 rows of files for our experiment. For each
sequences to analyze the operation, location, and parameter malware, the dataset has 57 attributes. Then, using our
information of each API call to determine the behavior of recommended application, we transform this XML file into
malware. Focusing on the examination of dangerous a non-sparse matrix. Non-sparse matrix has two integers,
dynamic libraries loaded by portable executable files, the first of which indicates the number of qualities and the
Chaganti, R et al. Use both static and dynamic analysis, second of which indicates the significance of those
with a multi-view feature fusion approach suggested and properties. This matrix's first row is displayed as follows.
also a hybrid technique suggested along in [19] to extract (SystemSettings.DeviceEncryptionHandlers.dll|f226d169
common properties of malware instances. Using both static 22369a8ea24e8156db40a373|34404|240|8226|14|12|1024
and dynamic analysis, Zhu, et al. [20] hybrid approach 00|62464|0|98176|4096|0|6442450944|4096|512|10|0|10|0|
proposes extracting common aspects of malware instances. 10|0|180224|1024|202454|3|16736|262144|4096|1048576|
They make use of dynamic-link libraries and API-called 4096|0|16|6|4.60129473839|2.71413309005|6.381756727
functions. In order to spot malicious instances, Hasan, et 95|27136.0|1536|102400|27197.3333333|1176|102000|38|
al. [21] analyze the frequency of system actions that 174|6|4|1|3.44737833601|3.44737833601|3.44737833601|
portable executables trigger. 1076.0|1076|1076|256|16|1)
Where the ‘|’ indicate the separation of parameters of one
It is used to conduct tasks automatically, examine files, and
row of the file. The 1st part contains the name of the file,
gather thorough analytical data. These discoveries retrieve
the second is the Hash code, etc.
API call traces, information on registry and file
The decision-making history of the malicious program in
modification, network traffic logs, and particular log data
the WEKA platform is examined last. To execute malware
of the malware's flow path within an isolated operating
safely in computer systems and stop it from spreading,
system. Cuckoo Sandbox has been utilized by researchers
some programs, including the SandBox tool and virtual
Sraw, J. S et al. [22] for malware investigation.
machine, may create a malware executive history [28]. The
To extract elements like registry activity, API calls, and XML file contains useful information, including calls to
imported libraries, another effort concentrated on the system library files, file creation, search and change
memory pictures. Additionally, it evaluated the operations, registry operations, information about primary

International Journal of Intelligent Systems and Applications in Engineering IJISAE, 2023, 11(1), 330–338 | 332
processes, creation of mutexes (which allow multiple is installed. The host computer has the Cuckoo host
programs thread for sharing a single resource), alterations installed. Cuckoo's host configuration is set up in
to virtual memory, email transmissions, registry accordance with the virtual machine that will be utilized to
operations, and switch communications. The proposed run the sample. While the NAT adapter is used to connect
software reads and stores all the data in a non-sparse matrix the Cuckoo guest (XP virtual machine) to the internet,
[29]. Virtual Box only Adapter (Vboxnet0) is used to connect
the virtual machine and Cuckoo host. A snapshot is kept of
4. Malware Analysis Framework the virtual machine's initial state, which is a malware-free
We provide our suggested process for identifying and and unharmed condition. The Python script cuckoo.py
categorizing a sample of a file in this section. The proposed (cuckoo host) is run with root privileges to begin analysis
method-operational logic flow is shown in the flow graph on any file. Once the Cuckoo host is started, we may send
in Figure 1. Behavior monitoring, feature extraction, Data files to a virtual computer for examination in accordance
collection, analysis, report handling, and detection and with Cuckoo parameters. Cuckoo Sandbox executes the
classification are the first four steps of this process. The virtual machine's files in a clean state when a sample is
following subsections offer a full discussion of these given, monitors every activity taking place in the virtual
phases. environment, and creates a report for each sample [31-32].
The web interface and API Calls may both be used to get
the Analyzed Report "AR". To create Macro and Micro
datasets, the Report Handler is used to retrieve the AR, as
mentioned in the next part.

5. Experimental Results and Discussion


We used the WEKA tool in this part to put our strategy into
practice. For the categorization techniques, we utilize a PC
with an Intel Core i5 2.13 GHz CPU and 8 GB of RAM.
Several classification techniques, including K-Neighbor,
XGB, Random Forest, and Light GBM approach, were
used for this investigation. We compared how well various
Fig. 1. Flow of operation for malware classification categorization techniques performed in two malware data
sets.
The Guests are the isolated environments where the For the suggested classification techniques, Table 1 details
malware samples are really securely performed and studied the analysis of statistical Data Sets 1 and 2. The elements
during the whole analysis process. Two subphases that make up the classification techniques include
comprise the analysis phase: i) Sandbox configuration and Correctly Classified Instances and Incorrectly Classified
(ii) Sandbox configuration. The following is a detailed Instances. Through this comparison, we are able to
discussion of these subphases: demonstrate that the regression classification algorithm
detects malware the best. As an illustration, the 5281
4.1. Sandbox Configuration
malicious programs and 5259 benign programs.
Configuring Cuckoo Sandbox [30] is crucial if you want to In our datasets, a total of 10540 samples out of which there
obtain malware behavior reports and make sure malware are 5281 harmful samples and 5259 benign samples. Using
samples operate correctly, including all of their a daily downloading routine, the Mal share website is used
capabilities. In the real world, many malware samples take to download the infected samples [27]. Then, using
use of various flaws that may exist in certain software VirusTotal [30], each sample is verified and stored
products. As a result, it's crucial to include a variety of according to its date. To be included in our dataset, the
services in the virtual machines that the sandbox creates. sample must have the support of five antivirus engines. As
VirtualBox serves as the hypervisor for the virtual previously mentioned, malware samples of the same sort
machines utilized by Cuckoo. One Intel Core i5 2.13 GHz have comparable characteristics and actions. It is difficult
CPU and 8 GB of RAM, and an internet connection make to determine the malware type's ground-truth label, as
up a virtual machine's specifications. Adobe PDF reader, different anti-virus providers may assign several detection
Python, and Windows 10 (64-bit) are the installed labels (types) to the same scanned sample.
programs on the virtual system.
4.2. Malware analysis lab set up
A malware analysis environment was developed. In the
guest machine's starting menu, the required Cuckoo agent

International Journal of Intelligent Systems and Applications in Engineering IJISAE, 2023, 11(1), 330–338 | 333
TABLE 1: Dataset description 90.89%, 0.799, and 0.908, which are considered to be
No. of excellent predictions result.
Sample Type %
samples Researchers thus start investigating other malware sample
Adware 135 1.28083491 tagging methods. For instance, in [23], Thakur, D et al. use
Backdoor 132 1.25237192 the open-source, automated program AVClass to identify
Hack Tool 13 0.12333966 the type of malware from a sample, in addition to a
PUP 21 0.19924099 confidence level that was determined by using the level of
Malicious
Ransom 221 2.09677419 anti-virus application from VirusTotal and the engine-level
Riskware 7 0.06641366 agreement. The ground-truth labeling is outside the
Spyware 241 2.28652751 purview of this research, but for our malicious dataset, each
Trojan 4302 40.8159393 sample's malware type is identified according to data
Virus 74 0.70208729 provided by the Malware- bytes engine in VirusTotal.
Worm 135 1.28083491 Based on VirusTotal data, the age of our harmful samples
APIMDS 142 1.34724858 is in the range of April 2020 and June 2021. Table 2 lists
CNET 153 1.4516129 the different malware categories and the number of
CYGWIN 2864 27.1726755 samples for each category.
DLL files 568 5.38899431 The benign samples come from a total of eight sources. We
Benign File Hippo 27 0.25616698 downloaded the APIMDS dataset after installing a new
Portable copy of Windows 10 and extracting from
263 2.49525617
applications c:windowssystem32 directory's, the Windows executables
WINDOWS and DLL files. (1) In order to test legal downloading, we
996 9.44971537
10
used free websites. (2) From the file Hippo website, we
Windows
246 2.33396584 downloaded the top 43 programs and the top 300 portable
executable
Windows apps(3) We extracted the two folders, CYGWIN
Total 10540 -
and WINDOWS 10 benign samples, from the benign
Researchers thus start investigating other malware sample
dataset from downloaded files. Windows executable files
tagging methods. For instance, in [23], Thakur, D et al. use
are included in both directories and were copied from the
the open-source, automated program AVClass to identify
required author sources. Using VirusTotal, each and every
the type of malware from a sample, in addition to a
benign sample from the eight sources has been confirmed.
confidence level that was determined by using the level of
Table 1 lists the no of trials from each safe source.
anti-virus application from VirusTotal and the engine-level
agreement. The ground-truth labeling is outside the 5.1. Method Evaluation
purview of this research, but for our malicious dataset, each
In this part, we assess the effectiveness of our suggested
sample's malware type is identified according to data
approaches for distinguishing malware from harmless
provided by the Malware- bytes engine in VirusTotal.
samples and then categorizing them into the appropriate
Based on VirusTotal data, the age of our harmful samples
classifications. For training and testing, the benign dataset
is in the range of April 2020 and June 2021. Table 2 lists
is divided into 4207 (80 percent) and 1052 (20 percent)
the different malware categories and the number of
samples, respectively. In addition to this, the training and
samples for each category.
testing datasets are separated for each malware-type
The benign samples come from a total of eight sources. We
dataset where each sample set contains 80% and 20% of
downloaded the APIMDS dataset after installing a new
the samples, respectively, to ensure a fair evaluation. As a
copy of Windows 10 and extracting from
result, we have a total of 1056 malware samples for testing
c:windowssystem32 directory's, the Windows executables
and 4225 malware samples for training purposes.
and DLL files. (1) In order to test legal downloading, we
used free websites. (2) From the file Hippo website, we
downloaded the top 43 programs and the top 300 portable
Windows apps(3) We extracted the two folders, CYGWIN
and WINDOWS 10 benign samples, from the benign
dataset from downloaded files. Windows executable files
are included in both directories and were copied from the
required author sources. Using VirusTotal, each and every
benign sample from the eight sources has been confirmed.
Table 1 lists the no of trials from each safe source (1)
Among the four algorithms, the MLP and MLR outperform Fig. 2. Experimental Result of different Classifiers
the SVR and SLR in slope stability prediction. The MLP
has an accuracy parameter, Kappa value, and AUC of

International Journal of Intelligent Systems and Applications in Engineering IJISAE, 2023, 11(1), 330–338 | 334
actual malicious or benign malware type and negative
refers to not a malware type sample.
TABLE 2: EXPERIMENTAL RESULT OF DIFFERENT
CLASSIFIERS
F1
Classifiers Dataset Precision Recall score Accuracy
Kneighbors Malicious 0.98 0.97 0.98 0.967742
Classifier Benign 0.94 0.96 0.95 0.972231
XGB Malicious 0.98 0.99 0.98 0.978178
Classifier Benign 0.98 0.95 0.97 0.981105
Fig. 3. Ensemble-Based Voting Classifier Random Malicious 0.98 0.99 0.99 0.981973
Forest Benign 0.98 0.96 0.97 0.979932
Light GBM Malicious 0.99 0.99 0.99 0.982314
5.2. Detection of Malicious Behavior
Classifier Benign 0.98 0.97 0.97 0.980115
In this series of studies, we test how well Methods 1 and 2
distinguish harmful samples from benign ones. To achieve Performance results for Methods 1 and 2 are shown in
that, in method 1, On the malicious and benign training Table 2 and Table 3 respectively. According to Table 2,
datasets (4225 and 4207, respectively), we have trained our utilizing Method 1, XGBoost outperforms the other three
models using four distinct machine-learning approaches as machine learning algorithms in terms of accuracy for
mentioned above in Table 2. Using the aforementioned benign datasets with scoring 98.1105 and Light GBM
machine-learning techniques, we perform 10-fold cross- outperforms the other three machine learning algorithms in
validation on the datasets to avoid overfitting where the terms of accuracy for Malicious datasets scoring 98.2314.
total dataset was divided into 10 parts and the same model The detail of method 1 is shown in figure 2. According to
was run ten times for the same dataset with a different set Table 3, utilizing Method 2, XGB+ Light GBM Classifier
of test sets. The 1056 and 1052 harmful and benign testing outperforms the other three machine learning algorithms in
datasets, respectively, are tested using the models. terms of accuracy for benign datasets scoring 98.4325, and
However, in method 2 we have used three different Light GBM+ Random Forest Classifier outperforms the
ensemble techniques on the above-mentioned datasets. The other three machine learning algorithms in terms of
different ensemble techniques used were (XGB+ Random accuracy for Malicious datasets with scoring 98.5312.
Forest Classifier), Light GBM+ Random Forest Classifier Method 2 scoring 98.4325. As a result, we carry out more
and XGB+ Light GBM Classifier. In each of the cases, the Light GBM+ Random forest trials employing 10-fold
accuracy and other statistical criteria are analyzed. The cross-validation. The evaluation details are shown in figure
effectiveness of the suggested techniques is assessed using 3. The performance outcomes of Methods 1 and 2 are then
the conventional machine learning performance criteria assessed in terms of various parameters.
listed below: TABLE 3: RESULT OF ENSEMBLE-BASED VOTING
• TP (True Positive): It is the % of datasets that are CLASSIFIER
actually positive and also predicted as positive. F1
• FP (False Positives): These are the % of data samples Classifiers Dataset Precision Recall score Accuracy
that are actually negative but wrongly predicted as a XGB+ Malicious 0.98 0.99 0.99 0.981205
positive sample. Random
• TN (True Negatives): % of samples that were expected forest
Classifier Benign 0.98 0.97 0.97 0.982305
to be negative and turned out to be negative
Light Malicious 0.98 0.99 0.99 0.985312
• FN (False Negatives): % of samples that were expected
GBM+
to be positive but turned out to be negative. Random
• Recall: The proportion of positive results that were forest
really expected to be positive, or the TP rate (also known Classifier Benign 0.98 0.96 0.97 0.981201
as sensitivity) XGB+ Malicious 0.98 0.99 0.99 0.982114
• Precision: The percentage of favorable predictions that Light
actually materialize GBM
• Accuracy: The ratio of samples accurately predicted Classifier Benign 0.98 0.97 0.97 0.984325
(TP+NP) to all samples collected for testing (TP + TN + 5.3. Performance of the Methods and Misclassifications
FP + FN)
• F-Measure: It is the measure of harmonic mean Recall In this part, we computed the performance of Methods 1
and precision. and 2 and explain when Method 2 might perform better
Here we have considered two sets of malware, one as than Method 1. As previously said, all approaches
malicious and the other as benign. Hence positive refers to accomplish the same goal; the tokenization methods are

International Journal of Intelligent Systems and Applications in Engineering IJISAE, 2023, 11(1), 330–338 | 335
what really set them apart from one another. While Method techniques is suitable for all kinds of slope scenarios, and
2 regards each AP call's argument as a unique feature, none was sufficient to address the existing problem.
Method 1 interprets the complete collection of arguments
for each API call as a single token. If the sample has called 6. State-of-the-art of the different models
a few API calls but many arguments have been provided Here in this section, we contrast our strategies with those
for each API call, Method 2 performs better than Method 1 of previous works that take API parameters into account.
since the many arguments for each call can make up for the Our comparison takes into account (i) Detection accuracy
few total API calls. Table 3 demonstrates that this finding (ii) Necessary API data, including determining and the
is correct. It is evident that Method 2 outperforms Method limitations include the frequency counter for a specific API
1 marginally in terms of malware detection. The causes of request, recognizing API sequence trends, and more. The
the misclassifications in our suggested techniques are then API parameters have been utilized in the research listed
discussed. below to create malware detection and/or type
5.4. Classification of Malware Types categorization models. Both [10] and [11] employ pattern
recognition algorithms to identify a shared sequence of API
The purpose of this collection of experiments is to assess
calls and parameters, as was mentioned in Section 2.
how well Methods 1 and 2 perform in categorizing
However, by removing and/or introducing certain API
malware samples into their appropriate classes. The
calls. A pattern may be changed. In contrast, [1, 5, 6]
malicious samples are divided into their categories using
employ malware detection methods based on the frequency
the same features that were used to divide the samples into
of API calls. In [24], the distinction between benign and
harmful and benign classifications. Our harmful samples
malicious samples was made using the frequency metric of
fit into one of ten malware categories, as was already
calling particular API calls and their parameters. For
explained (Table 1). Each malware-type dataset is divided
malware identification, Yong et al. employed frequent item
into training and testing datasets, which each include 80%
sets of API calls and their parameters in [14]. Statistics
and 20% of the samples, respectively, in order to provide a
pertaining about and their parameters include the
thorough validation and ensure that the machine learning
frequency, mean, and size of parameter arguments that
modules are taught using a suitable number of samples
were proposed by Hasan, H et al. [21] utilized to identify
from each sort. As a result, we have 2108 samples for
harmful software activity. By the removal of and/or adding
testing and 8432 pieces of malware and benign code for
API calls and by changing the frequency counter values,
training. The same five machine-learning techniques are
malware developers can easily get around the
employed to train models using the training dataset. Tables
aforementioned frequency-based techniques.
2 and 3 respectively present the performance results for
Compared to the previous research, our methodologies are
Methods 1 and 2. Utilizing Method 1, XGBoost
distinct. i) As a result of the fact that we don't rely on the
outperforms the other three machine learning algorithms in
sequence or pattern of the API calls, nor do we consider
terms of accuracy for benign datasets with a score of
their individual methods are resistant to malware mutation
98.1105, and Light GBM outperforms the other three
and obfuscation tactics (such as changing the order of API
machine learning algorithms in terms of accuracy for
calls or repeatedly using certain API calls and/or
malicious datasets with a score of 98.2314, as shown in
arguments). Instead, our approaches solely take into
Table 2. Table 3 shows that using Method 2, the XGB+
account the frequency of API calls and the values of such
Light GBM Classifier outperforms the other three machine
requests. (ii) Our method does not consider statistical traits
learning algorithms in terms of accuracy for benign
like mean, frequency, or the size of the API parameters.
datasets with a score of 98.4325, and the Light GBM+
(iii) Because our approaches employ unique feature
Random Forest Classifier outperforms the other three
generation functions to improve the retrieved API-based
machine learning algorithms in terms of accuracy for
characteristics for improved processing, domain
malicious datasets with a score of 98.5312. The Score of
knowledge of the complicated arguments is not necessary.
98.4325 for method 2. We do more Light GBM+ Random
(iv) None of the current methods have investigated the
forest experiments using 10-fold cross-validation as a
potential for using each API call's parameter element
result. Next, numerous factors are responsible for giving
independently, as demonstrated in Method 2.
different kind of accuracy of the models that were used in
These benefits enable our method to overcome the scaling
Methods 1 and 2.
challenge posed by the high memory consumption and
(2) All of the study's parameters are vulnerable to slope
computational complexity associated with the use of high
failure, therefore determining slope stability using a single
dimensional feature space. Table 3 provides a comparison
metric is useless. The variable δ is perhaps the most
of our strategy with the comparable research stated
profound aspect to MLR model and MLP models, while
previously. As seen in Tables 2 and 3, our suggested
slope geometry attributes are also critical. It should also be
approaches have outperformed the most recent methods.
highlighted that neither of the supervised learning

International Journal of Intelligent Systems and Applications in Engineering IJISAE, 2023, 11(1), 330–338 | 336
7. Conclusion and Future Work J. A., “ Systematic approach to malware analysis
(SAMA)”, Applied Sciences, 10(4), 1360,2020.
This research presented a novel classification-based data [7] Mehtab, A., Shahid, W. B., Yaqoob, T., Amjad, M. F.,
mining method for identifying malware behavior. First, our Abbas, H., Afzal, H., & Saqib, M. N., “AdDroid: rule-
proposed application is used to transform a malware based machine learning framework for android malware
behavior executive history XML file into a non-sparse analysis. Mobile Networks and Applications”, 25(1),
matrix. The WEKA input data set was then translated from 180-192,2020.
this matrix. We used the WEKA tool to apply the suggested [8] Akhtar, M. S., & Feng, T., “Malware Analysis and
procedures to an actual case study data set to demonstrate Detection Using Machine Learning Algorithms”,
Symmetry, 14(11), 2304,2022.
performance effectiveness. We have performed two
[9] S Aboaoja, F. A., Zainal, A., Ghaleb, F. A., Al-rimy, B.
operations in method 1 and method 2 on the same data sets. A. S., Eisa, T.A. E., & Elnour, A. A. H., “Malware
Classification techniques including K- neighbor, Random Detection Issues, Challenges, and Future Directions: A
Forest, and Light GBM algorithms in method 1 and also a Survey”, Applied Sciences, 12(17), 8482,2022.
few ensemble techniques were used in method 2 for the [10] Smith, M. R., Johnson, N. T., Ingram, J. B., Carbajal, A.
same datasets. For classifying malware detection, the J., Haus, B. I., Domschot, E., .& Kegelmeyer, W. P,
regression classification approach performed best. “Mind the gap: On bridging the semantic gap between
Additionally, we used the ensemble classification approach machine learning and malware analysis”, In Proceedings
to examine the same data set. The evaluation's findings of the 13th ACM Workshop on Artificial Intelligence and
Security, (pp. 49-60),2020.
showed how useful the suggested data mining and
[11] de Vicente Mohino, J. J., Bermejo-Higuera, J., Bermejo
ensemble method were more effective in finding malware. Higuera, J. R., Sicilia, J. A., Sánchez Rubio, M., &
With reference to figure 2 and figure 3 and by paying Martínez Herraiz, J. J. “MMALE a methodology for
attention to the experimental findings, classifying malware analysis in linux environments”,2021.
behavioral characteristics of malware can be an easy way [12] Pereberina, A., Kostyushko, A., & Tormasov, A., “An
to create behavioral antivirus. A genuine behavioral approach to dynamic malware analysis based on system
antiviral platform based on categorization via an ensemble and application code split”, Journal of Computer
algorithm will be developed and examined in the next Virology and Hacking Techniques,1-11,2022.
work. [13] Almomani, I., Ahmed, M., & El-Shafai, W., “Android
malware analysis in a nutshell”, PloS one,17(7),
e0270647,2022.
All authors have equally contributed to this research work. [14] McDole, A., Gupta, M., Abdelsalam, M., Mittal, S.,
Conflicts of interest Alazab, M., “Deep Learning Techniques for Behavioral
Malware Analysis in Cloud IaaS”, In: Stamp, M., Alazab,
The authors declare no conflicts of interest. M., Shalaginov, A. (eds) Malware Analysis Using
Artificial Intelligence and Deep Learning. Springer,
References Cham, (pp. 269-285), 2021
[15] Jeffrey, N., Tan, Q., & Villar, J. R., “Anomaly Detection
[1] Kumar, R., Alenezi, M., Ansari, M. T. J., Gupta, B. K.,
of Security Threats to Cyber-Physical Systems: A Study”,
Agrawal, A., & Khan, R. A. , “Evaluating the impact of
In International Workshop on Soft Computing Models in
malware analysis techniques for securing web
Industrial and Environmental Applications,(pp. 3-12).
applications through a decision-making framework under
Springer, Cham,2023
fuzzy environment”. Int. J. Intell. Eng. Syst, 13(6), 94-
[16] Amer, E., Zelinka, I., & El-Sappagh, S., “A multi-
109, 2020
perspective malware detection approach through
[2] Balaji, K. M., & Subbulakshmi, T., “Malware Analysis
behavioral fusion of API call sequence”, Computers &
Using Classification and Clustering Algorithms”,
Security,110, 102449,2021
International Journal of e-Collaboration (IJeC),18(1), 1-
[17] Zou, D., Wu, Y., Yang, S., Chauhan, A., Yang, W.,
26,2022
Zhong, J., ... & Jin, H., “IntDroid: Android malware
[3] Akhtar, M. S., & Feng, T., “Malware Analysis and
detection based on API intimacy analysis”, ACM
Detection Using Machine Learning Algorithms”,
Transactions on Software Engineering and Methodology
Symmetry, 14(11), 2304, 2022.
(TOSEM), 30(3), 1-32,2021
[4] Hadiprakoso, R. B., Kabetta, H., & Buana, I. K. S,
[18] Schofield, M., Alicioglu, G., Binaco, R., Turner, P.,
“Hybrid-based malware analysis for effective and
Thatcher, C., Lam, A., & Sun, B, “Convolutional neural
efficiency android malware detection”. In 2020
network for malware classification based on API call
International Conference on Informatics, Multimedia,
sequence”, In Proceedings of the 8th International
Cyber and Information System (ICIMCIS), (pp. 8-12).
Conference on Artificial Intelligence and Applications,
IEEE,2020.
(AIAP 2021),2021
[5] Hwang, C., Hwang, J., Kwak, J., & Lee, T., “Platform-
[19] Chaganti, R., Ravi, V., & Pham, T. D, “A multi-view
independent malware analysis applicable to windows and
feature fusion approach for effective malware
Linux environments”, Electronics, 9(5), 793,2020.
classification using Deep Learning”, Journal of
[6] Bermejo Higuera, J., Abad Aramburu, C., Bermejo
Information Security and Applications, 72, 103402,2023
Higuera, J. R., Sicilia Urban, M. A., & Sicilia Montalvo,

International Journal of Intelligent Systems and Applications in Engineering IJISAE, 2023, 11(1), 330–338 | 337
[20] Zhu, H. J., Gu, W., Wang, L. M., Xu, Z. C., & Sheng, V. has published no. of international journal and Conference papers.
S., “Android malware detection based on multi-head His research area is Cyber Security, Data Ware Housing and Data
squeeze-and-excitation residual network”, Expert Mining Computer Organization and Architecture.
Systems with Applications, 212, 118705,2023
[21] Hasan, H., Ladani, B. T., & Zamani, B., “MEGDroid: A
model-driven event generation framework for dynamic Dr. Susanta Kumar Das joined the Dept. of
android malware analysis”, Information and Software Computer Science in 1993. He has teaching
Technology, 135, 106569,2021 experience of 23 years in the department. He
[22] Sraw, J. S., & Kumar, K., “Using static and dynamic has attended no. of national & international
malware features to perform malware ascription”, ECS conferences. To his credit, he has served as
Transactions, 107(1), 3187,2022. H.O.D for 2 years in the department. At present
[23] Thakur, D., Singh, J., Dhiman, G., Shabaz, M., & Gera, he is the coordinator of M.Tech(S.F) course & as coordinator of
T., “Identifying major research areas and minor research spoken tutorial project conducted by IIT Bombay & funded by
themes of android malware analysis and detection field MHRD, Govt of India. Fourteen no. of scholars are awarded Ph.D
using LSA”, Complexity,2021 under his guidance. One D.Sc degree is awarded in Computer
[24] Al-Dwairi, M., Shatnawi, A. S., Al-Khaleel, O., & Al- Science under his guidance. He has been felicitated award of
Duwairi, B., “Ransomware-Resilient Self-Healing XML honour by Dept. of Mathematics, Maharshi Dayanand University
Documents. Future Internet”, 14(4), 115,2022. Rohtak, Haryana in the international conference on History &
[25] Rafiq, H., Aslam, N., Ahmed, U., & Lin, J. C. W., Development of Mathematical Science & Symposium on
“Mitigating Malicious Adversaries Evasion Attacks in Nonlinear Analysis. His research are in Software Engineering &
Industrial Internet of Things”, IEEE Transactions on Network Security. .
Industrial Informatics, 2022
[26] Lebbie, M., Prabhu, S. R., & Agrawal, A. K.,
“Comparative Analysis of Dynamic Malware Analysis
Tools. In Proceedings of the International Conference on
Paradigms of Communication”, Computing and Data
Sciences, (pp. 359-368). Springer, Singapore,2022
[27] Kartel, A., Novikova, E., & Volosiuk, A., “Analysis of
visualization techniques for malware detection”, In 2020
IEEE Conference of Russian Young Researchers in
Electrical and Electronic Engineering (EIConRus) (pp.
337-340), 2020
[28] Liu, S., Feng, P., Wang, S., Sun, K., & Cao, J.,
“Enhancing malware analysis sandboxes with emulated
user behavior”, 2022, Computers & Security, 115,
102613,2020
[29] Yadav, C. S., Singh, J., Yadav, A., Pattanayak, H. S.,
Kumar, R., Khan, A. A., ... & Alharby, S., “Malware
Analysis in IoT & Android Systems with Defensive
Mechanism”, Electronics, 11(15), 2354,2022.
[30] Lebbie, M., Prabhu, S. R., & Agrawal, A. K.,
“Comparative Analysis of Dynamic Malware Analysis
Tools”, In Proceedings of the International Conference on
Paradigms of Communication, Computing and Data
Sciences, (pp. 359-368), Springer, Singapore,2022.
[31] Palša, J., Ádám, N., Hurtuk, J., Chovancová, E., Madoš,
B., Chovanec, M., & Kocan, S., “MLMD—A Malware-
Detecting Antivirus Tool Based on the XGBoost Machine
Learning Algorithm, Applied Sciences, 12(13),
6672,2022
[32] Louk, M. H. L., & Tama, B. A., “Tree-Based Classifier
Ensembles for PE Malware Analysis: A Performance
Revisit”, Algorithms, 15(9), 332,2022

Biography
Mr. D Anil Kumar completed his M.Tech
degree in Computer Science from Berhampur
University Odisha in the year 2009 Currenly
pursuing Ph.D from Berhampur University,
Odisha. He has 18+ teaching experience
worked as a different organizations as an Assistant Professor He

International Journal of Intelligent Systems and Applications in Engineering IJISAE, 2023, 11(1), 330–338 | 338

You might also like