0% found this document useful (0 votes)
5 views

BIGDATA IEEE EXPLORE Binary malware image classification using machine learning with Local binary pattern

This paper presents a novel methodology for malware classification using machine learning and Local Binary Pattern (LBP) features extracted from binary malware images. The proposed approach reorganizes malware images into 3 by 3 grids, applies LBP for feature extraction, and utilizes TensorFlow for classification, achieving higher accuracy compared to traditional methods. Experimental results indicate that the TensorFlow + LBP combination yields an accuracy of 93.17%, outperforming other classifiers such as SVM and KNN.

Uploaded by

shariad9158
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

BIGDATA IEEE EXPLORE Binary malware image classification using machine learning with Local binary pattern

This paper presents a novel methodology for malware classification using machine learning and Local Binary Pattern (LBP) features extracted from binary malware images. The proposed approach reorganizes malware images into 3 by 3 grids, applies LBP for feature extraction, and utilizes TensorFlow for classification, achieving higher accuracy compared to traditional methods. Experimental results indicate that the TensorFlow + LBP combination yields an accuracy of 93.17%, outperforming other classifiers such as SVM and KNN.

Uploaded by

shariad9158
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

2017 IEEE International Conference on Big Data (BIGDATA)

Binary Malware image Classification using Machine


Learning with Local Binary Pattern

Jhu-Sin Luo Dan Chia-Tien Lo


Department of Computer Science Department of Computer Science
Kennesaw State University Kennesaw State University
jluo6@students.kennesaw.edu dlo2@kennesaw.edu

Abstract—Malware classification is a critical part in the cyber- malware images are reorganized. The pixels of the original
security. Traditional methodologies for the malware classification malware images are constructed by line by line. We rearrange
typically use static analysis and dynamic analysis to identify mal- each pixel of images by 3 by 3 grids. Second, the LBP is
ware. In this paper, a malware classification methodology based applied on the malware image to extract features. Finally,
on its binary image and extracting local binary pattern (LBP) malware images are classified by TensorFlow and the result
features is proposed. First, malware images are reorganized into
would be compared with other classifiers.
3 by 3 grids which is mainly used to extract LBP feature. Second,
the LBP is implemented on the malware images to extract features
in that it is useful in pattern or texture classification. Finally, II. RELATED WORK
Tensorflow, a library for machine learning, is applied to classify In [1], L.Nataraj, S.Karthikeyan, G. Jacob and B. S.
malware images with the LBP feature. Performance comparison Manjunath visualize malware into grey scale. They applied
results among different classifiers with different image descriptors GIST descriptor on the malware images. The GIST descriptor
such as GIST, a spatial envelop, and the LBP demonstrate that
our proposed approach outperforms others.
is useful on scene classification. The K-nearest neighbor is
utilized to classify malware images. In [5], Aziz Makandar
Keywords—malware, classification, machine learning, visualiza- and Anita Patrot applied Discrete Wavelet Transformation
tion, local binary pattern. (DWT) on the malware images to extract features. They use
Support Vector Machine (SVM) to discriminating the malware
I. I NTRODUCTION classes. In [14], Aziz Makandar and Anita Patrot obtain the
global features of the malware images by using gabor wevelet
Over the past few years, the Internet usage had experienced transform and GIST. And Artificial neural network (ANN) is
an exponential growth. It has became an important part of our used to train and test malware images.
daily lifes. The cybersecurity is also playing a role in that
the online financial activities such as the online payment and III. O UR M ETHODOLOGY
online money transaction become widespread [1]. The users In this section, we demonstrate our approach step by step.
of the Internet face threats from the malware which causes First, we demonstrate malware visualization and reorganiza-
detriment to users of computer and the Internet. AV-TEST, an tion. Second, we introduce the LBP and how to apply the LBP
IT security Institute, registers over 583 million the malware in on our images. Third, we show our TensorFlow architecture
2017[2] and based on their reports, the amount of the malware which we utilize to train and classify.
dramatically increases every year.
Traditional methodologies for the malware classification or A. Malware Visualization
detection mainly use static analysis and dynamic analysis to In [1], L.Nataraj, S.Karthikeyan, G. Jacob and B. S. Man-
identify type of the malware and behavior of the malware. junath visualized the malware into grey scale image in the
Both methodologies have their advantages and disadvantages. range [0, 255]. The width of image is fixed and the height
Static analysis examines the executable file without actually is allowed to vary. In [14], Aziz Makandar and Anita Patrot
executing. It extracts the binary code from the file to generate also convert malware into grey scale in the range [0, 255]. In
the patterns or features which could be used to identify [5], the malware is also visualized into grey scale image and
whether the file is the malware or not. The static analysis is normalized into 256*256 dimension.
ineffective against different code obfuscation[3]. On the other
hand, dynamic analysis verify the file by executing on the Our methodology is that we reorganized the grey scale
secure environment or virtual environment. By executing file, malware images which are provided by [1, 4] L.Nataraj,
the behaviors of the malware is able to observe. Nonetheless, S.Karthikeyan, G. Jacob and B. S. Manjunath. They convert
dynamic analysis still exist disadvantages. The malware might the malware into images with grey scale. The malware images
have different behaviors in two different environments or some with grey scale are obtained by reading malware in binary. A
behaviors may need to be triggered on specific circumstances. Malware binary is read as a vector of 8 bit unsigned integers
and then arranged into 2D array (Figure 2). We rearrange each
In this paper, a malware classification approach based pixel in the malware images into 3 by 3 grid (Figure 3). We
on image processing and convolutional neural network is convert malware images into 3 by 3 grid in that it is suitable
proposed. First, as Figure 1 demonstrates, each pixel in the for extracting LBP descriptor.

978-1-5386-2715-0/17/$31.00 ©2017 IEEE 4664


Extract Classify by
Reorganize by
Features by using
3 by 3 grid
using LBP Tensorflow

Fig. 1: Overview of entire System

25 10 29 0 0 1 1 2 4 0 0 4
8 bit vector
Malware Binary Binary to 8 bit Threshold Multiply
7 27 30 0 1 128 8 0 108 8
convert to grey
0100101101101… vector
56 41 13 1 1 0 64 32 16 64 32 0
scales image
LBP = 4+8+32+64 = 108

Fig. 2: Reorganized malware image Fig. 4: The LBP operator

C. TensorFlow Architecture
Malware image
We use TensorFlow [13, 15] for training and testing. As
Figure 5 shows, we use 3x3 convolutional filter with ReLU
and then perform 2x2 max pooling layer with stride 2 to
downsample. The number of first convolutional filter is 16.
The size of second convolutional filter is also 3x3 but with 32
Read each pixel line by line filters. The output of max pooling is multi-dimensional. The
flattening layer is applied to convert multi-dimensional nodes
{{first pixel},{second pixel}…..}
into one dimensional nodes. After flattening output, the fully
connected layer is obtained.

D. Dataset

reorganized each 9 pixels 3 by 3 grid The dataset we use is provided by [1, 4]. This dataset
includes 32 families and around 12000 malware images with
1st pixel 2nd pixel 3rd pixel 10th pixel 11th pixel 12th pixel grey scale (table I). The types of malwares mainly belong
to trojan, password steeler and virus. We use 20% of each
4th pixel 5th pixel 6th pixel 13th pixel 14th pixel 15th pixel malware family dataset for training and the rest for testing.
7th pixel 8th pixel 9th pixel 16th pixel 17th pixel 18th pixel
IV. E XPERIMENTAL R ESULTS
We evaluate Tensorflow for the LBP features classification,
Fig. 3: Reorganized malware image and use LBP features for training Support Vector Machine
(SVM) classifier and k-nearest neighbor (KNN) classifier. We
also implement GIST [9, 10, 11] features with TensorFlow,
KNN and SVM. Table II, table III and table IV are the
B. Local Binary Pattern confusion matrices of Tensorflow, KNN and SVM using LBP
feature. According to the confusion matrices, we discover
The Local Binary Pattern, a visual descriptor, is useful for that the malware belong to family 28, 29 and 30 which are
texture analysis and texture classification [6,7,8]. As Figure Virut.A, Virut.AC and Virut.AT respectively are easy to get
4 demonstrates, the value of central pixel is threshold. The 8 confused. As seen in table II, Tensorflow can differentiate these
neighbors around a pixel are compared with the central pixel. If three with higher accuracy than others. Table V displays the
a neighbor’s value is greater than central pixel, the value of the accuracy of different methodologies over 32 malware families.
neighbor is written ’1’. The value of neighbor which less then
threshold is written ’0’. The threshold results are multiplied V. P ROS AND C ONS
with weights which are given by power of two. The central
value is the sum of the multiplying results. For each pixel in Our approach run with GPU, which is significantly shorter
the image do the same process. The final LBP descriptor can the execution time (Figure 6). Moreover, this method doesn’t
be obtained by calculating the histogram of the image. have to run on a virtual machine or virtual environment

4665
Output

Input image 3x3 Convolutional + 2x2 Max pooling 2x2 Max pooling Flattened layer Fully connected
3x3 Convolutional +
ReLU layer(Stride 1) (Stride 2) ReLU layer(Stride 1) (Stride 2) layer

Fig. 5: Architecture of TensorFlow

Malware Family Type of malware Amount of malware


Adialer.C.UPX Adialer 188
Agent.FYI Backdoor 116
Aliser.7825 Trojan 256
Allaple.A Worm 4540
Alueron_Gen_J Trojan 198
Autorun.A Worm 106
Azero.A Trojan 121
Backdoor.Agent.AsPack Backdoor 180
C2Lop Trojan 692
Dialplatform.B Dialer 177
Dontovo.A TrojanDownloader 162
Fakerean Rogue 381
Farfli.I Backdoor 94
Instantaccess Dialer 431 TABLE II: Confusion Matrix of Tensorflow using LBP feature
Lolyda.AA1 PasswordSteeler 213
Lolyda.AA2 PasswordSteeler 184
Lolyda.AA3 PasswordSteeler 123
Lolyda.AT PasswordSteeler 159
Luder.B Virus 509
Malex.gen!J Trojan 136
Nuwar.A Virus 51
Obfuscator.AD TrojanDownloader 142
Rbot.gen Backdoor 158
Sality.AM Virus 127
Skintrim.N Trojan 80
Swizzor.gen TrojanDownloader 520
VB.AT Worm 408
Virut.A Virus 133
Virut.AC Virus 269
Virut.AK Virus 571 TABLE III: Confusion Matrix of KNN using LBP feature
Wintrim.BX TrojanDownloader 97
Yuner.A Worm 800

TABLE I: Malware Family

to observe the behavior of malware. Additionally, because


our approach is based on image processing, we can apply
other image descriptors to do the voting to acheive higher
classification accuracy. Although malware images can be an-
alyzed with our approach based on local binary pattern and
machine learning, there still have countermeasures. Because
our approach converts the malware into binary and reorganizes.
Therefore, if a rival who rewrites whole the program in other
way or uses other instructions instead of original one result in
TABLE IV: Confusion Matrix of SVM using LBP feature
changing whole the pattern of malware image, our approach
may fail.

4666
Classification Method Accuracy classify the results with TensorFlow library. The comparison
TensorFlow+LBP 93.17% over different classifiers and features demonstrates that using
SVM+LBP 87.88% LBP with TensorFlow obtains higher accuracy than others
approaches. Furthermore, extending dataset of malware, con-
KNN+LBP 85.93%
verting malware to RGBA color space, designing different
TensorFlow+GIST 87.88%
architectures of TensorFlow and testing more image descriptors
SVM+GIST 81.23% is our future works, which may improves the research and
KNN+GIST 82.83% obtains more comprehensive methodology.

TABLE V: Experiment Result over 32 Malware Family R EFERENCES


[1] Nataraj L., Karthikeyan S., Jacob G., Manjunath B. S.,”The malware
Images: Visualization and Automatic Classification,”International Sym-
KNN TensorFlow SVM posium on Visualization for Cyber Security (VizSec) ,July 20, 2011,
Pittsburg, PA, USA.
80.00
[2] Malware statistic from: https://www.av-test.org/en/statistics/the malware/
[3] A. Moser, C. Kruegel and E. Kirda, ”Limits of Static Analysis for Mal-
ware Detection,” Twenty-Third Annual Computer Security Applications
60.00
Conference (ACSAC 2007), Miami Beach, FL, 2007, pp. 421-430.
[4] Malware Images from
http://vision.ece.ucsb.edu/˜lakshman/malware images/album
Min 40.00 77.6
[5] Aziz Makandar and Anita Patrot, ”Wavelet Statistical Feature Based
Malware Class Recognition and Classification using Supervised Learning
Classifier,” Oriental Journal of Computer Science and Technology, ISSN:
20.00
0974-6471, June 2017, Vol. 10, No. (2): Pgs. 400-406
[6] T. Ojala, M. Pietikainen, and D. Harwood, ”A Comparative Study of
3.717 2.067 Texture Measures with Classification Based on Feature Distributions,”
0.00
Pattern Recognition, vol. 29, pp. 51-59, 1996.
[7] Chao Zhu, Charles-Edmond Bichot and Liming Chen, ”Multi-scale Color
Fig. 6: Average Execution Time of each Methodology Local Binary Patterns for Visual Object Classes Recognition,” 2010
20th International Conference on Pattern Recognition, Istanbul, 2010,
pp. 3065-3068.
[8] Chao Zhu, Charles-Edmond Bichot and Liming Chen, ”Image region de-
VI. F UTURE W ORK scription using orthogonal combination of local binary patterns enhanced
with color information,” Pattern Recognition, Volume 46, Issue 7, 2013,
While our experimental results demonstrate that the ac- Pages 1949-1963, ISSN 0031-3203
curacy using LBP as feature is slightly higher than other [9] Aude Oliva, Antonio Torralba, ”Modeling the Shape of the Scene: A
methodologies, there are ways of how the experiment could be Holistic Representation of the Spatial Envelope,” International Journal
improved. The first priority would be to extend the malware of Computer Vision, Vol. 42(3): 145-175, 2001.
family, which means that increases the size and classes of [10] A. Oliva and A. Torralba, ”Building the gist of a scene: the role of
dataset. At the meantime, converting malware file into image global image features in recognition,” Prog. Brain Res. Vis. Percept.,
vol. 155, pp. 2336, 2006.
uses different approaches such as converting to RGBA color
[11] A. Torralba, K. P. Murphy, W. T. Freeman and M. A. Rubin, ”Context-
space instead of grey scale and using color-LBP[7, 8, 12] as Based Vision System for Place and Object Recognition,” Proceedings
feature, which is one possible future work. Additionally, we Ninth IEEE International Conference on Computer Vision, Nice, France,
plan to design a different architecture of Tensorflow and ex- 2003, pp. 273-280 vol.1.
amine more different image descriptor to increase the accuracy [12] Chao Zhu, Charles-Edmond Bichot and Liming Chen, ”Color orthog-
and reduce time consumption. onal local binary patterns combination for image region description,”
Rapport technique RR-LIRIS-2011-012, LIRIS UMR, vol. 5205, p. 15,
2011
ACKNOWLEDGMENT [13] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.
This material is based in part upon work supported by the S. Corrado, A. Davis, J. Dean, M. Devin, et al., ”TensorFlow: Large-
Scale Machine Learning on Heterogeneous Distributed Systems,” arXiv
National Science Foundation under Grant Numbers 1623724, preprint arXiv:1603.04467, 2016
1438858, 1244697, and 1241651. Any opinions, findings, and [14] Aziz Makandar and Anita Patrot, ”Malware Analysis and Classification
conclusions or recommendations expressed in this material are using Artificial Neural Network,” 2015 International Conference on
those of the author(s) and do not necessarily reflect the views Trends in Automation, Communications and Computing Technology (I-
of the National Science Foundation. Furthermore, we would TACT-15), Bangalore, 2015, pp. 1-6.
like to thank the authors [1] for providing the malware image [15] R. Pilipovi and V. Risojevi, ”Evaluation of convnets for large-scale
dataset. scene classification from high-resolution remote sensing images,” IEEE
EUROCON 2017 -17th International Conference on Smart Technologies,
Ohrid, Macedonia, 2017, pp. 932-937.
VII. C ONCLUSION
An experimental result shows that the accuracy based
on our approach is 93.17%. The experiment is performed
to classify malware images over 32 families around 12000
malware images. We reorganize malware images and utilize
Local Binary Pattern as descriptor to extract features and

4667

You might also like