0% found this document useful (0 votes)

30 views6 pages

ICIIS

Uploaded by

deju

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views6 pages

ICIIS

Uploaded by

deju

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Windows Malware Detection Based on Cuckoo

Sandbox Generated Report Using Machine Learning
Algorithm
Shiva Darshan S.L.1 , Ajay Kumara M.A.2 , and Jaidhar C.D.3
Department of Information Technology
National Institute of Technology Karnataka
Surathkal, Mangalore, India
[email protected] , [email protected] , [email protected]

Abstract—Malicious software or malware has grown rapidly works allow an unknown malware to execute in an isolated en-
and many anti-malware defensive solutions have failed to detect vironment and screen its run-time behavior. Such frameworks
the unknown malware since most of them rely on signature- have been in use as a major aspect of the manual investigation
based technique. This technique can detect a malware based
on a pre-defined signature, which achieves poor performance process for a while; they are progressively utilized as a primary
when attempting to classify unseen malware with the capability component of the automated malware detection approach. The
to evade detection using various code obfuscation techniques. main upside of the automated malware detection technique is
This growing evasion capability of new and unknown malwares that it is able to recognize the unseen malware on the basis
needs to be countered by analyzing the malware dynamically in of the observed activities gathered during the execution of the
a sandbox environment, since the sandbox provides an isolated
environment for analyzing the behavior of the malware. In this malware. Majority of the sandboxes observe at the system call
paper, the malware is executed on to the cuckoo sandbox to interface the behavior of a user mode process. System calls are
obtain its run-time behavior. At the end of the execution, the a routine that allow the operating system to interact with the
cuckoo sandbox reports the system calls invoked by the malware user-level process to perform their desired task. These tasks
during execution. However, this report is in JSON format and include reading data from files, delivering packets across the
has to be converted to MIST format to extract the system calls.
The collected system calls are structured in the form of N- network, and recording of entry from the registry. Looking
Grams, which help to build the classifier by using the Information deeper into the execution of a program, a lot more interesting
Gain (IG) as a feature selection technique. A comprehensive information can be gathered.
experiment was conducted to perceive the best fit classifier among This paper presents a classic approach to the detection of
the chosen classifiers, including the Bayesian-Logistic-Regression, malware by extracting only the system calls (i.e., operation
SPegasos, IB1, Bagging, Part, and J48 defined within the WEKA
tool. From the experimental results, the overall best performance field) from the Malware Instruction Set (MIST) report that
for all the selected top N-Grams such as 200, 400, and 600 goes were obtained by implementing the MIST conversion process
to SPegasos with the highest accuracy, highest True Positive Rate for all those runtime behavioral reports of malware produced
(TPR), and lowest False Positive Rate (FPR). by cuckoo sandbox. Further, the extracted system calls are
Keywords—Sandbox, Malware Detection, Machine Learning, used to generate the sequence of N-Grams of specified length
Hypervisor, Virtual machine, N-Gram Feature Extraction.
such as N=2, N=3, and N=4, and then, adopt the Information
Gain (IG) feature selection method to calculate a score for
I. I NTRODUCTION
each N-Gram. Later, the top N-Grams are selected based on
Malware is also known as malicious software. It is a the highest IG score. The selected top N-Grams are processed
malicious code developed with the intention of damaging the by the classifier for classification.
function of a system. Malware has the capacity to disorder The rest of the paper is organized as follows. In Section
the normal operation by infecting the system or network II, we study the background of MIST instruction and its
[1]. It enters a system either through multiple media or gets representation. In Section III, we review earlier research to
downloaded into the system as a genuine application. Once it detect malicious executables. In Section IV, we describe our
gets into the system, it checks for vulnerabilities and infects proposed approach. In section V, experimental results are
the system, if the system is highly vulnerable. Generally, discussed. Finally, conclusion is drawn in the Section VI.
antimalware defensive solutions are signature dependent and
run inside the host machines. They are inadequate to thwart II. BACKGROUND
the emerging advanced malware attacks. The prime task of the malware detection system is to
Computerized malware examination frameworks (or sand- identify known as well as unknown malware and defend the
boxes) [2] [3] are one of the most recent security innovation integrity of the system, while performing its function. The
used to detect malware based on behavior traits. Such frame- analysis of the malware can be performed in two ways i.e.,

Authorized licensed use limited to: Penn State University. Downloaded on November 15,2020 at 02:14:45 UTC from IEEE Xplore. Restrictions apply.

code analysis, and behavior analysis. The Code analysis is the detecting the system call interception by other sandbox
generally achieved in a static way by obtaining a complete systems [5]. On the other hand, virtualization-based sandbox
overview of the software. A major limitation of the code techniques [2] [7] play a vital role by examining the manip-
analysis technique is that it is often clogged by evasion ulated structure of the operating system that is caused by the
techniques such as binary packers, polymorphism, and anti- types and behavior of new variants of malware.
debug techniques. In behavior analysis, the malware behavior
is monitored, while it is running on a host system. Behavior- Cuckoo [3] is another malware analysis system, which
based malware analysis is an efficient way of observing the provides a detailed behavior report of a Windows executable
actions of the malware, while several existing monitoring file, when executed inside an isolated environment. Cuckoo
tools provide the behavioral report [3]. Generally, behavioral- can analyze many different malicious files (executables, doc-
based malware analysis tools execute a malware sample in an ument exploits, etc.) and malicious web-sites in a virtualized
isolated environment to obtain accurate system level behavior environment. Cuckoo is able to trace the API calls and general
by monitoring and recording the system calls invoked by the behavior of the input file and can easily integrate within the
malware. A summarized observed behavior of the malware existing framework. The current development of the sandbox
sample is tabulated in the analysis report. Monitoring suites based system [8] [9] is sufficient in providing behavior activity
such as Anubis and CWSandbox produce the behavior report of input an executable file in the form of a behavioral report.
in textual or XML-based format that provide system-level However, an accurate examination of the malware based on the
behavior of the malware, that includes system calls details. sandbox generated report involves extensive manual analysis.
A human-analyst can easily analyze textual or XML-based In addition, the sandbox also provides a report for benign
formats as they are unsuitable for further automatic analysis executables files on the monitored machine. In such cases,
due to a negative impact on the runtime of the analysis. XML precisely detecting actual malware activities from other benign
representations are inappropriate for finding generic behavioral executable applications is a challenging task. The sandbox
patterns. Unlike XML, textual representations are tough due to report is available in an unstructured form to precisely extract
aggregation and even increase the size of the report. In contrast actual semantic information (e.g, system call). Authors Rick
to textual and XML-based format, a MIST is used to record et al. [4] made an attempt to form an effective detection
all system level behavior in which the system call arguments of malware based on the invoked system call sequence. The
are organized in different levels of blocks (Fig. 1). collected system call sequence structured in the form of N-
Grams and N-Gram feature extraction technique is widely used
for different input sources [10] [11] [12]. In another work,
Tesauro et al. [13] applied the idea of N-Grams as features
for malware detection. The N-Grams were selected from most
frequent classes in malware and benign files. The N-Grams
outperform when the experiment is carried with a larger feature
Fig. 1: MIST representation of system call. set. Recent reports have shown that feature selection based on
the IG has produced the best results in classifying malicious
The first field category denotes the type of system calls executables files from benign executable files [10].
and the second field operation represents a particular system
call. In each MIST instruction, the type of the argument Machine learning algorithms are witnessed as a promising
block and its size depends on the particular system call. The technique to perform an accurate detection of malicious mal-
MIST representation is an optimized form for an effective and ware from benign executable files. Kolter et al. [14] describe
efficient way of analyzing the malware behavior using machine machine learning algorithm to classify the malicious executa-
learning algorithms [4]. bles that appear in the wild by encoding the N-Grams as
features for classification. Automated behavior-based malware
III. RELATED WORK
analysis framework using machine learning technique was
There have been several dynamic malware sandbox ap- proposed [15] that convert the report generated by the sandbox
proaches proposed in literature that perform dynamic malware into MIST format to identify the unknown malware with
analysis using sandbox technology. Willems et al. [5] devel- similar behavior.
oped an open source tool called CWSandbox that allows a
malware sample to execute either in a native environment or In our work, we have used the cuckoo sandbox to gather the
in a virtual Windows environment. Monitoring of the API calls system-level behavior of the executable files. The system calls’
is accomplished by the hook functions of analysis component. sequence, triggered by the executable files (processes), are ex-
The DRAKVUF [6] is another dynamic malware analysis tracted from the cuckoo sandbox generated report. IG feature
system that performs insight trace analysis of execution of selection technique is employed to choose the best features to
malware, including modern stealthy kernel rootkit by inter- construct the Final Feature Vector (FFV). Machine learning
cepting the kernel heap allocation of the targeted system. In algorithm is employed to classify the malware executable files
addition, DRAKVUF efficiently addresses the challenges in from benign executables files based on the FFV.

Authorized licensed use limited to: Penn State University. Downloaded on November 15,2020 at 02:14:45 UTC from IEEE Xplore. Restrictions apply.

IV. PROPOSED WORK Duplicate removal

•

Our proposed work distinguishes the malware files from In first step system call extraction, we select only the
benign files on the basis of system calls’ sequence is structured operation field, i.e., the system calls of all the benign MIST
using a heuristic method called N-Grams analysis. It adopts files (1, 2, . . . .,10, 11, . . . . n) and all the malware
the IG technique to compute the IG score for the each N- MIST files (1, 2, . . . .,10, 11, . . . . n) as shown in Fig.
Gram and extracts the top N-Grams (features) based on the 4, Since we have the record of all system level behaviors.
highest IG score in order to prepare a FFV that is needed for The extracted operation fields are stored in a text file and
classification. Fig.2 depicts an overview architecture of the grouped in sequence to form N-Grams of variable length, i.e.,
proposed work. N=2, N=3, N=4, etc. The lengthier the N-Grams size, better
characteristics are represented. A snippet of extraction is as
shown in Fig. 3. We have grouped N-Grams of length four
bytes, while forming the N-Grams in the second step of the
generation phase. In the third step, the formed N-Grams are
sorted in descending order to get the highest order sequence
of N-Grams. After the sorting operation in the fourth step,
the duplicates should be removed, if observed to get unique
N-Grams. The unique N-Grams can be employed for better
feature selection and also provide better classification.

Fig. 2: System Architecture of the proposed work.

A. Behavior analysis
Since, the cuckoo sandbox functions at hypervisor as a
separate entity, it examines the behavior of malware which
are running on VMs to obtain the behavioral analysis report
of running executables in JavaScript Object Notation (JSON)
(a) Steps to generate Benign N-Gram Files.
format.
B. Conversion process
The analysis reports obtained in JSON format are pre-
processed to obtain the MIST, since it is a preferred format
that uses a smaller file size and reduces processing time. Since
our approach is specific to observation on monitored system
calls, we are concerned with the operation field (system call
as shown in Fig. 1) of MIST files to generate N-Grams (4
bytes) files as shown in Fig. 3.
(b) Steps to generate Malware N-Gram Files.
Fig. 4: System call extraction phase.
The above explanation is prerequisite for the feature selec-
tion approach, since it cannot be performed without the N-
Gram formation. The formed Benign N-Gram files [B1, B2,
B3,. . . ,Bn] and Malware N-Gram files [M1, M2, M3, . .
.,Mn] must undergo union operation considering each benign
N-Gram files [B1 ∪ B2 ∪ B3 ∪ . . . ∪ Bn] and malware
N-Gram files [M1 ∪ M2 ∪ M3 ∪ . . . ∪ Mn]. After the
union operation, the benign union N-Gram files and malware
union N-Gram files must be sorted in non-increasing order
Fig. 3: Snippet of N-Gram extraction using MIST file.
and duplicates must be removed, if observed to achieve unique
To generate the N-Gram files, we follow the following steps: benign N-Gram files and unique malware N-Gram files. The
• System calls extraction, occurrences of each unique benign N-Gram in the benign N-
• N-Gram generation, Gram files are observed and tabulated as N-Gram frequency
• Sorting of N-Grams, and table for the benign class, and in the same way, the occurrences

Authorized licensed use limited to: Penn State University. Downloaded on November 15,2020 at 02:14:45 UTC from IEEE Xplore. Restrictions apply.

of each unique malware N-Gram in the malware N-Gram files collected from the public source 1 and the remaining 100
are observed and tabulated as N-Gram frequency table for the malware MIST files were obtained by implementing the MIST
malware class. conversion process for all those runtime behavioral reports
produced by cuckoo sandbox by injecting the Kelihos Trojan.
As explained earlier, we extracted N-Grams of different sizes
2bytes, 3bytes and 4bytes to measure which N-Gram size
achieves the best detection rate. A separate experiment was
conducted for each N-Gram size.The N-Grams are sorted in
decreasing order based on the IG score and duplicate N-Gram
is removed, if found. The class-wise document frequency for
each class was determined for each N-Gram to prepare the
contingency table. The IG method is used to calculate a score
for each N-Gram and the top K N-Grams are determined
based on the highest IG score. Experiment were conducted
Fig. 5: N-Gram frequency table for benign class and malware for different values of K such as 200, 400, and 600. Further,
class with feature contingency table. the best features were drawn at each K value for different N-
The feature contingency table is then prepared based on Gram lengths. The best features were pre-processed through
the values accommodated in the N-Gram frequency table for the instruction converter to prepare ARFF files for the selected
benign category and malware category as depicted in Fig. 5. N-Grams. The ARFF files were submitted to the WEKA tool
The feature contingency table is used to calculate Information for classification. A wide set of experiments were conducted
Gain [10]. Information Gain is computed by the following to determine which classifier achieved best malware detection
equation, rate with low False Positive Rate (FPR). We evaluated the
performance of several classification algorithms stated in the

IG(N − Gram) = P (vN −Gram , C) WEKA tool.
vN −Gram ∈{0,1} C∈{Ci }
Our objective was to know the best classification algorithm
among the several stated in the WEKA tool. From that per-
P (vN −Gram , C)
log spective, we selected six classifiers among the eight different
P (vN −Gram ), P (C)
categories mentioned in the WEKA tool. The six classifiers
(1)
chosen were the Bayesian-Logistic-Regression, SPegasos, IB1,
Where, C is one of the two categories - benign or malware Bagging, Part and J48 classified under Bayes, functions, lazy,
and vN −Gram is the value of N-Gram. vN −Gram = 1 indicates meta, rules and trees of WEKA. For evaluation purposes, we
that the N-Gram present either in benign N-Gram files or mal- measured and tabulated the values of True Positive Rate (TPR),
ware N-Gram files and vN −Gram = 0, otherwise. P(vN −Gram , False Positive Rate (FPR), Precision, Recall, F-measure, ROC
C) is the proportion of N-Gram files in C in which the N- Area and Accuracy for all the chosen six classifiers as shown
Gram takes on value vN −Gram . P(vN −Gram ) is the proportion in TABLE I and TABLE II.
of benign N-Gram files or malware N-Gram files in entire Two experiments were carried out by us: In the first ex-
training set such that N-Gram takes the value vN −Gram . P(C) periment, we considered N-Gram of three bytes in order to
is the proportion of data set belonging to category C. The N- select the top N-Grams based on the highest score of IG. The
Grams are organized in non-increasing order based on the IG top N-Grams were selected in terms of 200, 400, and 600.
score and the topmost L number of N-Grams are extracted as From the experimental observation, as shown in Fig. 6, the
best features for classification purpose. highest accuracy was 89.77% for 200 N-Grams, 90.03% for
400 N-Grams, and 89.88% for 600 N-Grams yielded by the
C. Instruction Converter SPegasos classifier (Fig. 6a). The highest TPR of 0.898 for 200
The instruction converter converts the extracted features into N-Grams, 0.9 for 400 N-Grams, and 0.899 for 600 N-Grams
an ARFF (Attribute-Relation File Format) file. ARFF is an was produced by the SPegasos classifier (Fig. 6b). The lowest
ASCII text file that describes a list of instances sharing a set FPR of 0.102 for 200 N-Grams, 0.1 for 400 N-Grams, and
of attributes. It is an important process because the classifiers 0.101 for 600 N-Grams was given by the SPegasos classifier
of WEKA tool used in our approach works with the ARFF (Fig. 6c). Receiver Operating Characteristics (ROC) curves is
file. mainly used to compare the classification capability of the
different algorithms. Among the number of classifiers tested
V. EXPERIMENT RESULTS in this work, it was observed that SPegasos classifier attained
Our experimental data consists of 3000 benign MIST files the best results.
and 3100 malware MIST files. The malware MIST files con- Similarly, in the second experiment, N-Gram of length
sists of four different families such as Swizzor (1000), Basun four bytes was analyzed, and the results for highest accuracy
(1000), AutoIt (1000), and Kelihos Trojan (100). Among the
considered four different malware families the first three were
1 https://github.com/rieck/malheur/tree/master/data

Authorized licensed use limited to: Penn State University. Downloaded on November 15,2020 at 02:14:45 UTC from IEEE Xplore. Restrictions apply.

TABLE I: WEKA Classiﬁcation results for N-Gram Length 3 bytes.

N-Gram Length= 3 N-Gram Length= 3 N-Gram Length= 3
Selected Top N-Grams = 200 Selected Top N-Grams = 400 Selected Top N-Grams = 600
Classiﬁer C1 C2 C3 C4 C5 C6 C1 C2 C3 C4 C5 C6 C1 C2 C3 C4 C5 C6
0.894 0.902 0.881 0.912 0.899 0.896 0.882 0.894 0.874 0.903 0.877 0.886 0.882 0.904 0.874 0.908 0.888 0.874 B
TPR 0.895 0.887 0.885 0.88 0.886 0.899 0.906 0.899 0.882 0.885 0.9 0.915 0.91 0.89 0.882 0.885 0.881 0.923 M
0.894 0.894 0.883 0.896 0.893 0.898 0.894 0.896 0.878 0.894 0.889 0.9 0.896 0.897 0.878 0.897 0.885 0.899 W
0.105 0.113 0.115 0.12 0.114 0.101 0.094 0.101 0.118 0.115 0.1 0.085 0.09 0.11 0.118 0.115 0.119 0.077 B
FPR 0.106 0.098 0.119 0.088 0.101 0.104 0.118 0.106 0.126 0.097 0.123 0.114 0.118 0.096 0.126 0.092 0.112 0.126 M
0.106 0.106 0.117 0.104 0.108 0.102 0.106 0.104 0.122 0.106 0.112 0.1 0.104 0.103 0.122 0.103 0.115 0.101 W
0.895 0.888 0.885 0.884 0.888 0.899 0.903 0.898 0.881 0.887 0.897 0.912 0.908 0.891 0.881 0.888 0.882 0.919 B
Precision 0.894 0.9 0.881 0.909 0.897 0.897 0.885 0.894 0.875 0.901 0.88 0.889 0.885 0.903 0.875 0.906 0.887 0.88 M
0.894 0.894 0.883 0.897 0.893 0.898 0.894 0.896 0.878 0.894 0.889 0.901 0.896 0.897 0.878 0.897 0.885 0.9 W
0.894 0.902 0.881 0.912 0.899 0.896 0.882 0.894 0.874 0.903 0.877 0.886 0.882 0.904 0.874 0.908 0.888 0.874 B
Recall 0.895 0.887 0.885 0.88 0.886 0.899 0.906 0.899 0.882 0.885 0.9 0.915 0.91 0.89 0.882 0.885 0.881 0.923 M
0.894 0.894 0.883 0.896 0.893 0.898 0.894 0.896 0.878 0.894 0.889 0.9 0.896 0.897 0.878 0.897 0.885 0.899 W
0.894 0.895 0.883 0.898 0.893 0.898 0.893 0.896 0.878 0.895 0.887 0.899 0.894 0.898 0.877 0.898 0.885 0.896 B
F-measure 0.894 0.893 0.883 0.894 0.892 0.898 0.895 0.897 0.879 0.893 0.89 0.902 0.897 0.896 0.878 0.895 0.884 0.901 M
0.894 0.894 0.883 0.896 0.892 0.898 0.894 0.896 0.878 0.894 0.888 0.9 0.896 0.897 0.878 0.897 0.885 0.899 W
0.968 0.971 0.883 0.896 0.965 0.898 0.968 0.972 0.878 0.894 0.959 0.9 0.966 0.972 0.878 0.897 0.955 0.899 B
ROC Area 0.968 0.971 0.883 0.896 0.965 0.898 0.968 0.972 0.878 0.894 0.959 0.9 0.966 0.972 0.878 0.897 0.955 0.899 M
0.968 0.971 0.883 0.896 0.965 0.898 0.968 0.972 0.878 0.894 0.959 0.9 0.966 0.972 0.878 0.897 0.955 0.899 W
Accuracy (%) 89.43 89.42 88.30 89.62 89.25 89.77 89.40 89.63 87.82 89.40 88.85 90.03 89.60 89.68 87.78 89.67 88.47 89.88
TPR: True Positive Rate, FPR: False Positive Rate, C1: J48, C2: Bagging, C3: Ib1, C4: Bayesian Logistic Regression, C5: Part, C6: Spegasos,
B: Benign, M: Malware, W: Weighted Average

TABLE II: WEKA Classiﬁcation results for N-Gram Length 4 bytes.

N-Gram Length = 4 N-Gram Length = 4 N-Gram Length = 4
Selected Top N-Grams = 200 Selected Top N-Grams = 400 Selected Top N-Grams = 600
Classiﬁer C1 C2 C3 C4 C5 C6 C1 C2 C3 C4 C5 C6 C1 C2 C3 C4 C5 C6
0.899 0.902 0.88 0.9 0.899 0.921 0.879 0.904 0.881 0.904 0.885 0.9 0.881 0.907 0.881 0.903 0.88 0.894 B
TPR 0.885 0.885 0.878 0.878 0.887 0.88 0.907 0.887 0.873 0.878 0.9 0.891 0.904 0.887 0.882 0.877 0.887 0.905 M
0.892 0.894 0.879 0.889 0.893 0.9 0.893 0.896 0.877 0.891 0.893 0.896 0.893 0.897 0.882 0.89 0.884 0.9 W
0.115 0.115 0.122 0.122 0.113 0.12 0.093 0.113 0.127 0.122 0.1 0.109 0.096 0.113 0.118 0.123 0.113 0.095 B
FPR 0.101 0.098 0.12 0.1 0.101 0.079 0.121 0.096 0.119 0.096 0.115 0.1 0.119 0.093 0.119 0.097 0.12 0.106 M
0.108 0.106 0.121 0.111 0.107 0.1 0.107 0.104 0.123 0.109 0.108 0.104 0.108 0.103 0.118 0.11 0.117 0.101 W
0.886 0.884 0.879 0.881 0.889 0.887 0.904 0.889 0.874 0.881 0.898 0.892 0.901 0.889 0.882 0.88 0.886 0.904 B
Precision 0.898 0.918 0.88 0.898 0.898 0.9 0.882 0.902 0.88 0.902 0.887 0.899 0.884 0.905 0.881 0.9 0.881 0.895 M
0.892 0.901 0.879 0.889 0.893 0.894 0.893 0.896 0.877 0.891 0.893 0.896 0.893 0.897 0.882 0.89 0.884 0.9 W
0.899 0.921 0.88 0.9 0.899 0.902 0.879 0.904 0.881 0.904 0.885 0.9 0.881 0.907 0.881 0.903 0.88 0.894 B
Recall 0.885 0.88 0.878 0.878 0.887 0.885 0.907 0.887 0.873 0.878 0.9 0.891 0.904 0.887 0.882 0.877 0.887 0.905 M
0.892 0.9 0.879 0.889 0.893 0.894 0.893 0.896 0.877 0.891 0.893 0.896 0.893 0.897 0.882 0.89 0.884 0.9 W
0.893 0.902 0.879 0.89 0.894 0.895 0.891 0.897 0.877 0.892 0.892 0.896 0.891 0.898 0.882 0.891 0.883 0.899 B
F-measure 0.891 0.898 0.879 0.888 0.892 0.893 0.894 0.895 0.877 0.89 0.893 0.895 0.894 0.896 0.882 0.888 0.884 0.9 M
0.892 0.9 0.879 0.889 0.893 0.894 0.893 0.896 0.877 0.891 0.892 0.896 0.892 0.897 0.882 0.89 0.883 0.899 W
0.964 0.97 0.879 0.889 0.964 0.894 0.964 0.972 0.877 0.891 0.963 0.896 0.965 0.972 0.882 0.89 0.956 0.9 B
ROC Area 0.964 0.97 0.879 0.889 0.964 0.894 0.964 0.972 0.877 0.891 0.963 0.896 0.965 0.972 0.882 0.89 0.956 0.9 M
0.964 0.97 0.879 0.889 0.964 0.894 0.964 0.972 0.877 0.891 0.963 0.896 0.965 0.972 0.882 0.89 0.956 0.9 W
Accuracy (%) 89.20 89.37 87.93 88.92 89.30 90.03 89.27 89.57 87.7 89.1 89.25 89.57 89.20 89.68 88.17 88.97 88.35 89.95
TPR: True Positive Rate, FPR: False Positive Rate, C1: J48, C2: Bagging, C3: Ib1, C4: Bayesian Logistic Regression, C5: Part, C6: Spegasos,
B: Benign, M: Malware, W: Weighted Average

were 90.03% for 200 N-Grams, 89.57% for 400 N-Grams, 200 N-Grams, 0.104 for 400 N-Grams, and 0.101 for 600 N-
and 89.95% for 600 N-Grams with respect to the SPegasos Grams produced by the SPegasos classifier (Fig. 7c). From the
classifier (Fig. 7a). The highest TPR was 0.9 for 200 N-Grams, visual inspection of Fig. 6 and Fig. 7, we can conclude that
0.896 for 400 N-Grams, and 0.9 for 600 N-Grams obtained by SPegasos classifier turned out to be best and ensured better
the SPegasos classifier (Fig. 7b). The lowest FPR was 0.1 for classification for both N-Gram lengths three and four.

Authorized licensed use limited to: Penn State University. Downloaded on November 15,2020 at 02:14:45 UTC from IEEE Xplore. Restrictions apply.

- - - -

%DJJLQJ %DJJLQJ %DJJLQJ %DJJLQJ
,E ,E
,E %/5
,E
%/5
%/5 %/5
3DUW 3DUW 3DUW
3DUW
6SHJDVRV 6SHJDVRV 6SHJDVRV 6SHJDVRV

$FFXUDF\

52&$UHD

735

)35

6HOHFWHG7RS1*UDPV 6HOHFWHG7RS1*UDPV 6HOHFWHG7RS1*UDPV 6HOHFWHG7RS1*UDPV

(a) Accuracy. (b) TPR. (c) FPR. (d) ROC Area.

Fig. 6: Graphical representation considering evaluation measures such as (a) Accuracy, (b) True Positive Rate, (c) False Positive
Rate and (d) ROC area. When N-Gram length is three bytes.
- -
-
- %DJJLQJ
%DJJLQJ %DJJLQJ
%DJJLQJ ,E
,E ,E ,E
%/5
%/5 %/5
%/5 3DUW
3DUW 3DUW 6SHJDVRV
6SHJDVRV 6SHJDVRV 3DUW
6SHJDVRV

$FFXUDF\

52&$UHD

735

)35

6HOHFWHG7RS1*UDPV 6HOHFWHG7RS1*UDPV 6HOHFWHG7RS1*UDPV 6HOHFWHG7RS1*UDPV

(a) Accuracy. (b) TPR. (c) FPR. (d) ROC Area.

Fig. 7: Graphical representation considering evaluation measures such as (a) Accuracy, (b) True Positive Rate, (c) False Positive
Rate and (d) ROC area. When N-Gram length is four bytes.
VI. CONCLUSION [5] C. Willems, T. Holz, and F. Freiling, “Toward automated dynamic
malware analysis using cwsandbox,” IEEE Security and Privacy, vol. 5,
In order to detect the malicious activities of the malware, no. 2, pp. 32–39, 2007.
behavior analysis of the executable file (process) such as [6] T. K. Lengyel, S. Maresca, B. D. Payne, G. D. Webster, S. Vogl, and
A. Kiayias, “Scalability, fidelity and stealth in the drakvuf dynamic
system calls invoked by the input file during execution have malware analysis system,” in Proceedings of the 30th Annual Computer
been employed. The gathered system calls’ sequence chunked Security Applications Conference. ACM, 2014, pp. 386–395.
into N-Gram and each N-Gram treated as a feature. The IG [7] M. Neugschwandtner, C. Platzer, P. M. Comparetti, and U. Bayer,
“Danubis–dynamic device driver analysis based on virtual machine
feature selection method was used to choose the best features introspection,” in International Conference on Detection of Intrusions
based on highest IG score, and the selected features were used and Malware, and Vulnerability Assessment. Springer, 2010, pp. 41–
to prepare FFV needed by the classifier. The experiments were 60.
[8] Y. Qiao, Y. Yang, J. He, C. Tang, and Z. Liu, “Cbm: free, automatic
performed using different classifiers available in the WEKA malware analysis framework using api call sequences,” in Knowledge
tool. From the experimental observations, it was found that Engineering and Management. Springer, 2014, pp. 225–236.
the better classifier among the chosen six classifiers in this [9] J. Shi, Y. Yang, C. Li, and X. Wang, “Spems: A stealthy and practical
execution monitoring system based on vmi,” in International Conference
experimental work is the SPegasos since it achieved highest on Cloud Computing and Security. Springer, 2015, pp. 380–389.
accuracy, highest TPR, and lowest FPR compared to the [10] D. K. S. Reddy and A. K. Pujari, “N-gram analysis for computer virus
others. SPegasos achieved better detection rate for different detection,” Journal in Computer Virology, vol. 2, no. 3, pp. 231–239,
2006.
feature lengths of 200, 400, and 600. Our future work will [11] S. Jain and Y. K. Meena, “Byte level n–gram analysis for malware
aim to develop a multiprocessing model able to compute IG detection,” in Computer Networks and Intelligent Computing. Springer,
scores for larger N-Gram datasets. 2011, pp. 51–59.
[12] H. Parvin, B. Minaei, H. Karshenas, and A. Beigi, “A new n-gram
feature extraction-selection method for malicious code,” in International
R EFERENCES Conference on Adaptive and Natural Computing Algorithms. Springer,
2011, pp. 98–107.
[1] A. Shabtai, R. Moskovitch, Y. Elovici, and C. Glezer, “Detection of [13] G. J. Tesauro, J. O. Kephart, and G. B. Sorkin, “Neural networks for
malicious code by applying machine learning classifiers on static fea- computer virus recognition,” IEEE expert, vol. 11, no. 4, pp. 5–6, 1996.
tures: A state-of-the-art survey,” Information Security Technical Report, [14] J. Z. Kolter and M. A. Maloof, “Learning to detect and classify malicious
vol. 14, no. 1, pp. 16–29, 2009. executables in the wild,” Journal of Machine Learning Research, vol. 7,
[2] Anubis: Analyzing Unknown Binaries-http://analysis.iseclab.org/. no. Dec, pp. 2721–2744, 2006.
[3] Cuckoo Sandbox-https://cuckoosandbox.org/. [15] K. Rieck, T. Holz, C. Willems, P. Düssel, and P. Laskov, “Learning
[4] K. Rieck, P. Trinius, C. Willems, and T. Holz, “Automatic analysis and classification of malware behavior,” in International Conference
of malware behavior using machine learning,” Journal of Computer on Detection of Intrusions and Malware, and Vulnerability Assessment.
Security, vol. 19, no. 4, pp. 639–668, 2011. Springer, 2008, pp. 108–125.

Authorized licensed use limited to: Penn State University. Downloaded on November 15,2020 at 02:14:45 UTC from IEEE Xplore. Restrictions apply.

Development of Malware Detection and Analysis Mode
No ratings yet
Development of Malware Detection and Analysis Mode
50 pages
Malwarepjct PDF
No ratings yet
Malwarepjct PDF
70 pages
Malware Detection Using Machine Learning: Vinay Kumar MIT2021091 Mentor:-Dr. Abhishek Vaish
No ratings yet
Malware Detection Using Machine Learning: Vinay Kumar MIT2021091 Mentor:-Dr. Abhishek Vaish
23 pages
Im 2007
No ratings yet
Im 2007
48 pages
A Malware Detection Method
No ratings yet
A Malware Detection Method
74 pages
Malware Detection Using Machine Learning
No ratings yet
Malware Detection Using Machine Learning
4 pages
Malware Detection With LSTM Using Opcode Language
100% (1)
Malware Detection With LSTM Using Opcode Language
7 pages
A Behavior-Based Approach For Malware Detection
No ratings yet
A Behavior-Based Approach For Malware Detection
15 pages
A Case Study Malware Classification
No ratings yet
A Case Study Malware Classification
32 pages
DEF: Deep Ensemble Neural Network Classifier For Android Malware Detection
No ratings yet
DEF: Deep Ensemble Neural Network Classifier For Android Malware Detection
11 pages
Malware Detection Using Machine Learning and Deep Learning
No ratings yet
Malware Detection Using Machine Learning and Deep Learning
10 pages
(IJETA-V7I5P3) :prateek Nigam
No ratings yet
(IJETA-V7I5P3) :prateek Nigam
8 pages
Innovation in Cyber Threat Detection: Transformer-Based Approach
No ratings yet
Innovation in Cyber Threat Detection: Transformer-Based Approach
15 pages
A Behavior-Based Approach For Malware Detection: Rayan Mosli, Rui Li, Bo Yuan, Yin Pan
No ratings yet
A Behavior-Based Approach For Malware Detection: Rayan Mosli, Rui Li, Bo Yuan, Yin Pan
16 pages
Effective Malware Detection Based On Behaviour and Data Features
No ratings yet
Effective Malware Detection Based On Behaviour and Data Features
16 pages
A Multi-View Feature Fusion Approach For Effective Malware Classification Using Deep Learning
No ratings yet
A Multi-View Feature Fusion Approach For Effective Malware Classification Using Deep Learning
15 pages
Tuning The K Value in K-Nearest Neighbors For Malware Detection
No ratings yet
Tuning The K Value in K-Nearest Neighbors For Malware Detection
8 pages
FYP GROUP 2 Presentation-Proposal 1
No ratings yet
FYP GROUP 2 Presentation-Proposal 1
23 pages
A Framework For Detection of Malicious Code by Exploiting Machine Learning Techniques On Portable Executables
No ratings yet
A Framework For Detection of Malicious Code by Exploiting Machine Learning Techniques On Portable Executables
4 pages
Reasearch 1
No ratings yet
Reasearch 1
18 pages
Udayakumar 2017
No ratings yet
Udayakumar 2017
6 pages
Preprints202407 1214 v1
No ratings yet
Preprints202407 1214 v1
20 pages
Researchdemo 2
No ratings yet
Researchdemo 2
13 pages
The Everyday Healthy Vegetarian by Nandita Iyer
No ratings yet
The Everyday Healthy Vegetarian by Nandita Iyer
458 pages
Research 4
No ratings yet
Research 4
17 pages
Environment-Reactive Malware Behavior: Detection and Categorization
No ratings yet
Environment-Reactive Malware Behavior: Detection and Categorization
16 pages
A Novel Ensemble-Based Approach For Windows Malware Detection
No ratings yet
A Novel Ensemble-Based Approach For Windows Malware Detection
10 pages
Final Research - Merged
No ratings yet
Final Research - Merged
10 pages
14th ICCCNT 2023 Paper 943
No ratings yet
14th ICCCNT 2023 Paper 943
5 pages
Malware Detection and Classification Based On Graph Convolutional Networks and Function Call Graphs
No ratings yet
Malware Detection and Classification Based On Graph Convolutional Networks and Function Call Graphs
11 pages
Research Paper 2 Malware Detection
No ratings yet
Research Paper 2 Malware Detection
24 pages
Scalable Malware Detection System Using Big Data A
No ratings yet
Scalable Malware Detection System Using Big Data A
18 pages
The Curious Case of Machine Learning in Malware Detection: Sherif Saad, William Briguglio and Haytham Elmiligi
No ratings yet
The Curious Case of Machine Learning in Malware Detection: Sherif Saad, William Briguglio and Haytham Elmiligi
8 pages
Ijcna 2021 o 56
No ratings yet
Ijcna 2021 o 56
18 pages
Detection of Advanced Malware by Machine Learning Techniques
No ratings yet
Detection of Advanced Malware by Machine Learning Techniques
8 pages
The State-of-the-Art in AI-Based Malware Detection Techniques: A Review
No ratings yet
The State-of-the-Art in AI-Based Malware Detection Techniques: A Review
18 pages
15709-Article Text-55876-2-10-20220114
No ratings yet
15709-Article Text-55876-2-10-20220114
26 pages
Building A Malware Detection System Based On A Mac
No ratings yet
Building A Malware Detection System Based On A Mac
6 pages
Document Malware
No ratings yet
Document Malware
9 pages
Dynamic Malware Analysis Using Cuckoo Sandbox
No ratings yet
Dynamic Malware Analysis Using Cuckoo Sandbox
5 pages
Dynamic Malware Detection in Wireless Networks Using Deep Learning
No ratings yet
Dynamic Malware Detection in Wireless Networks Using Deep Learning
16 pages
Analysis of Cyber Security Threats Using
No ratings yet
Analysis of Cyber Security Threats Using
5 pages
Malware Application Detection Using Machine Learning
No ratings yet
Malware Application Detection Using Machine Learning
8 pages
Evaluation of Machine Learning For Smart Phone Malware Detection
No ratings yet
Evaluation of Machine Learning For Smart Phone Malware Detection
6 pages
Malware Detection Using ML
No ratings yet
Malware Detection Using ML
20 pages
Radon Transform Based Malware Classification in Cyb 2024 Results in Control
No ratings yet
Radon Transform Based Malware Classification in Cyb 2024 Results in Control
14 pages
Paprer CJ Usenix03
No ratings yet
Paprer CJ Usenix03
18 pages
Ijett V73i1p132
No ratings yet
Ijett V73i1p132
15 pages
The Curious Case of Machine Learning in Malware Detection: Sherif Saad, William Briguglio and Haytham Elmiligi
No ratings yet
The Curious Case of Machine Learning in Malware Detection: Sherif Saad, William Briguglio and Haytham Elmiligi
9 pages
Malware Detection Using Machine Leaning
No ratings yet
Malware Detection Using Machine Leaning
9 pages
Artificial Intelligence in Malware Detection: Cosolan Cornelia Ionela May 22, 2018
No ratings yet
Artificial Intelligence in Malware Detection: Cosolan Cornelia Ionela May 22, 2018
5 pages
Survey Paper of Group 7
No ratings yet
Survey Paper of Group 7
9 pages
Synopsis 1
No ratings yet
Synopsis 1
7 pages
Malware - Detection - Research - Paper - Updated Soheb6
No ratings yet
Malware - Detection - Research - Paper - Updated Soheb6
8 pages
Malware Detection Research Paper Updated Soheb6
No ratings yet
Malware Detection Research Paper Updated Soheb6
6 pages
Mini Project
No ratings yet
Mini Project
11 pages
676006d84b482 IJAR-49403
No ratings yet
676006d84b482 IJAR-49403
15 pages
Amutenda r206668v Technical Paper
No ratings yet
Amutenda r206668v Technical Paper
5 pages
Malware Detection Issues and Challenges
No ratings yet
Malware Detection Issues and Challenges
7 pages
506.6T-17 Visual Shotcrete Core Quality Evaluation Technote
No ratings yet
506.6T-17 Visual Shotcrete Core Quality Evaluation Technote
4 pages
Vita 3d-Master Shade Guide To Use
No ratings yet
Vita 3d-Master Shade Guide To Use
2 pages
Enhanced Condominium Concepts Review 20210501
No ratings yet
Enhanced Condominium Concepts Review 20210501
8 pages
1 F40, R-41, In-House IHTM-14 Test Report
No ratings yet
1 F40, R-41, In-House IHTM-14 Test Report
1 page
Altivar 61 For Medium Voltage Motors
No ratings yet
Altivar 61 For Medium Voltage Motors
34 pages
Mail Merge and Hyperlink
No ratings yet
Mail Merge and Hyperlink
7 pages
Introduction To Forex Trading
No ratings yet
Introduction To Forex Trading
12 pages
Java JVM Troubleshooting Guide
100% (1)
Java JVM Troubleshooting Guide
127 pages
Some Notes On Daphnis Et Chloé
No ratings yet
Some Notes On Daphnis Et Chloé
13 pages
Microsoft Defender for Endpoint
No ratings yet
Microsoft Defender for Endpoint
22 pages
Lesson Plan CSE 4th Sem Database Management System Swagatika Dalai
No ratings yet
Lesson Plan CSE 4th Sem Database Management System Swagatika Dalai
3 pages
Op GD Software Development Procedure
No ratings yet
Op GD Software Development Procedure
9 pages
He Mrut 006
No ratings yet
He Mrut 006
3 pages
Aiesec: Abbreviations Used in AIESEC Aka. How To Survive The First Weeks in
No ratings yet
Aiesec: Abbreviations Used in AIESEC Aka. How To Survive The First Weeks in
5 pages
Parallel & Distributed Computing
No ratings yet
Parallel & Distributed Computing
58 pages
Anc Assessment
No ratings yet
Anc Assessment
6 pages
Weber Vinogradov 2001 Nonvertebrate Hemoglobins Functions and Molecular Adaptations
No ratings yet
Weber Vinogradov 2001 Nonvertebrate Hemoglobins Functions and Molecular Adaptations
60 pages
SCHEME HND 1 General Computer II 2019-2020
No ratings yet
SCHEME HND 1 General Computer II 2019-2020
5 pages
3 Lxuzlmu 5 C 8 Z 1 Uym 1 NNN 2 M 7 XJ
No ratings yet
3 Lxuzlmu 5 C 8 Z 1 Uym 1 NNN 2 M 7 XJ
45 pages
Neil Davidson Don't Just Roll The Dice A Usefully Short Guide To
No ratings yet
Neil Davidson Don't Just Roll The Dice A Usefully Short Guide To
81 pages
30 REPHRASING TEST With SOLUTIONS
No ratings yet
30 REPHRASING TEST With SOLUTIONS
4 pages
Wiljam Flight Training: 050-01-01 Composition, Extent, Vertical Division
No ratings yet
Wiljam Flight Training: 050-01-01 Composition, Extent, Vertical Division
18 pages
Improving Statistical Methods To Protect Wildlife Populations - ScienceDaily
No ratings yet
Improving Statistical Methods To Protect Wildlife Populations - ScienceDaily
7 pages
Privacy Issues in Smart Home Devices Using Internet of Things - A Survey
No ratings yet
Privacy Issues in Smart Home Devices Using Internet of Things - A Survey
4 pages
Entrepreneurship and Innovation in Pharmacy - 2022 - Canvas
No ratings yet
Entrepreneurship and Innovation in Pharmacy - 2022 - Canvas
29 pages
History of Windows
No ratings yet
History of Windows
27 pages
Chapter 1 - Notes - Fixed Income Analysis
No ratings yet
Chapter 1 - Notes - Fixed Income Analysis
3 pages
Curves Lecture 5 Kdu Sri Lanka
No ratings yet
Curves Lecture 5 Kdu Sri Lanka
43 pages
Chapter 3
No ratings yet
Chapter 3
10 pages
RRL in Combined Cryptographic Algorithms
No ratings yet
RRL in Combined Cryptographic Algorithms
8 pages
Clannad - Onaji Takami He
No ratings yet
Clannad - Onaji Takami He
3 pages
Conference Coordinator-OMICS International
No ratings yet
Conference Coordinator-OMICS International
2 pages
Installing ICU 52
No ratings yet
Installing ICU 52
7 pages
Defense in Depth
From Everand
Defense in Depth
Qasim
No ratings yet
Penetration Testing Fundamentals-2: Penetration Testing Study Guide To Breaking Into Systems
From Everand
Penetration Testing Fundamentals-2: Penetration Testing Study Guide To Breaking Into Systems
Devi Prasad
No ratings yet

ICIIS

Uploaded by

ICIIS

Uploaded by

                 

Windows Malware Detection Based on Cuckoo

IV. PROPOSED WORK Duplicate removal

Fig. 2: System Architecture of the proposed work.

TABLE I: WEKA Classiﬁcation results for N-Gram Length 3 bytes.

TABLE II: WEKA Classiﬁcation results for N-Gram Length 4 bytes.

- - - -

  

   

   

(a) Accuracy. (b) TPR. (c) FPR. (d) ROC Area.

  

   

   

(a) Accuracy. (b) TPR. (c) FPR. (d) ROC Area.

You might also like

- - - -