Internet 2016 1 40 40038

Uploaded by

Aditya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views6 pages

Internet 2016 1 40 40038

Uploaded by

Aditya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

INTERNET 2016 : The Eighth International Conference on Evolving Internet

Static Detection of Malware and Benign Executable

Using Machine Learning Algorithm

Dong-Hee Kim∗ , Sang-Uk Woo∗ , Dong-Kyu Lee∗ and Tai-Myoung Chung†

∗ Deptof Electrical and Computer Engineering
Sungkyunkwan University, Suwon, Korea
Email: {kkim, suwoo, leedg84}@imtl.skku.ac.kr
† College of Software

Sungkyunkwan University, Suwon, Korea

Email: [email protected]

Abstract—One of the popular ways of detecting malware is addition, machine learning algorithm can detect a large amount
signature based pattern matching. However, the signature of of malware using relatively small amount of input training sets.
malware should be stored in advance for the pattern matching
detection. Moreover, it calculates the similarity of input data using The interested detection method is PE-miner framework
stored signature. Therefore, the storage problem and calcula- [10]. The PE format is a file format for executables, object
tion overheads occur undoubtedly. Also, detection possibility is code, DLLs, Font files, and others used in 32-bit and 64-bit
dropped, when malicious code is modified. So we use machine versions of Windows operating systems [11]. In shafiq et al.
learning algorithm technique for detecting malicious executable paper [10], they have analyzed the distinctive characteristics
and benign executable. However, previous technique has a limita- of PE-header between malicious executable and benign one.
tion on detecting Worms and Trojans. In this paper, distinguished They categorized malicious executable into 7 types; backdoor
features of Portable Executable header are used. For the machine + sniffer, Constructor + Virtool, DoS + Nuker, Flooder, Exploit
learning algorithm, Classification And Regression Tree (CART), + Hacktool, Work, Trojan and Virus. From the PE-header,
Support Vector Classification (SVC), and Stochastic Gradient
18 different features are founded by Shafiq. However, PE-
Descent (SGD) are applied for improving to detection rate.
The performance of each algorithm firstly evaluated to find header features might not convey useful information in a
the most outperformed algorithm each for classifying benign particular scenario. For example, some attribute value could
executable and malicious executable. And then, these algorithms have too much low value or dummy value, and some could
were combined to detect malware more precisely. be counter. Also, considering the application of the many
Keywords–Portable Executable Header; Machine Learning;
attributes increases the dimensional spaces in machine learning
Malware Detection; Intrusion Detection System. algorithm. This is the main reason for time delay in fitting pro-
cess. So, for reducing dimensionality of input feature space, a
preprocessor process is removing or combining the PE-header
I. I NTRODUCTION information with other similar features. Redundant Feature
Traditionally signature-based static method is mostly used Removal (RFR), Principal Component Analysis (PCA), and
for malware detection. Signature-based method has some Haar Wavelet Transform (HWT) mechanisms are used for
drawbacks. Pattern matching method, one of the signature- preprocessing the PE-header feature.
based static method, should possess all the pattern information The purpose of this paper is to evaluate the existing PE-
of malware samples before the detection. Saving all the pattern miner framework [10] and improving the detection rate by
informations, may causes the storage management problem and adjusting the attribute of training set and algorithm. We have
matching overheads. Moreover, detection efficiency of pattern chosen the PE feature from many other distinctive character-
matching method decreases, if pattern is changed by source istics because it has an almost fixed size of data structure
code modification (e.g., inserting or removing the opcode). regardless of program size. If the number of attributes com-
Therefore, machine learning-based malware detection meth- posing the training set is changed depending on data, it will
ods are being researched [1][2][3][4]. The purpose of using increase the complexity of training process. We expect that the
machine learning algorithm is to study the pattern from the attributes that extracted from previous research could not carry
learning set and to predict the classes or value from the given the characteristic of the malware according to the Windows
data [5]. The acceptable detection rate is described in several system changes. Also, in previous research [10], Shafiq et al.
previous researches. The various features of benign code and use insufficient amount of training set and sample file. For their
malicious code had been considered from many research paper. experiment, 1,477 benign sample files and 15,925 malware
Researchers have derived the distinctive characteristics which sample files were used. The most relevant information is stored
are from binary code [6][7], opcode [8][9], and Portable with the highest coefficients at each order of a transform.
Executable header (PE-header) of benign executables and The lower order coefficients can be ignored to get only the
malicious executables [10]. They have evaluated their result most relevant information. Decision Tree (J48), Instance Based
using a variety of machine learning algorithms. The advantage Learner (IBk), Native Bayes (NB), RIPPER (inductive rule
of using a machine learning technique is the prediction of learner), Support Vector Machine using Sequential Minimal
unknown class. It can detect not only known malware but also Optimization (SMO) algorithms are used for their experiment.
non-recognized malware through the pattern analysis itself. In The outputs of these algorithms were compared with each other