BIGDATA IEEE EXPLORE Binary malware image classification using machine learning with Local binary pattern
BIGDATA IEEE EXPLORE Binary malware image classification using machine learning with Local binary pattern
Abstract—Malware classification is a critical part in the cyber- malware images are reorganized. The pixels of the original
security. Traditional methodologies for the malware classification malware images are constructed by line by line. We rearrange
typically use static analysis and dynamic analysis to identify mal- each pixel of images by 3 by 3 grids. Second, the LBP is
ware. In this paper, a malware classification methodology based applied on the malware image to extract features. Finally,
on its binary image and extracting local binary pattern (LBP) malware images are classified by TensorFlow and the result
features is proposed. First, malware images are reorganized into
would be compared with other classifiers.
3 by 3 grids which is mainly used to extract LBP feature. Second,
the LBP is implemented on the malware images to extract features
in that it is useful in pattern or texture classification. Finally, II. RELATED WORK
Tensorflow, a library for machine learning, is applied to classify In [1], L.Nataraj, S.Karthikeyan, G. Jacob and B. S.
malware images with the LBP feature. Performance comparison Manjunath visualize malware into grey scale. They applied
results among different classifiers with different image descriptors GIST descriptor on the malware images. The GIST descriptor
such as GIST, a spatial envelop, and the LBP demonstrate that
our proposed approach outperforms others.
is useful on scene classification. The K-nearest neighbor is
utilized to classify malware images. In [5], Aziz Makandar
Keywords—malware, classification, machine learning, visualiza- and Anita Patrot applied Discrete Wavelet Transformation
tion, local binary pattern. (DWT) on the malware images to extract features. They use
Support Vector Machine (SVM) to discriminating the malware
I. I NTRODUCTION classes. In [14], Aziz Makandar and Anita Patrot obtain the
global features of the malware images by using gabor wevelet
Over the past few years, the Internet usage had experienced transform and GIST. And Artificial neural network (ANN) is
an exponential growth. It has became an important part of our used to train and test malware images.
daily lifes. The cybersecurity is also playing a role in that
the online financial activities such as the online payment and III. O UR M ETHODOLOGY
online money transaction become widespread [1]. The users In this section, we demonstrate our approach step by step.
of the Internet face threats from the malware which causes First, we demonstrate malware visualization and reorganiza-
detriment to users of computer and the Internet. AV-TEST, an tion. Second, we introduce the LBP and how to apply the LBP
IT security Institute, registers over 583 million the malware in on our images. Third, we show our TensorFlow architecture
2017[2] and based on their reports, the amount of the malware which we utilize to train and classify.
dramatically increases every year.
Traditional methodologies for the malware classification or A. Malware Visualization
detection mainly use static analysis and dynamic analysis to In [1], L.Nataraj, S.Karthikeyan, G. Jacob and B. S. Man-
identify type of the malware and behavior of the malware. junath visualized the malware into grey scale image in the
Both methodologies have their advantages and disadvantages. range [0, 255]. The width of image is fixed and the height
Static analysis examines the executable file without actually is allowed to vary. In [14], Aziz Makandar and Anita Patrot
executing. It extracts the binary code from the file to generate also convert malware into grey scale in the range [0, 255]. In
the patterns or features which could be used to identify [5], the malware is also visualized into grey scale image and
whether the file is the malware or not. The static analysis is normalized into 256*256 dimension.
ineffective against different code obfuscation[3]. On the other
hand, dynamic analysis verify the file by executing on the Our methodology is that we reorganized the grey scale
secure environment or virtual environment. By executing file, malware images which are provided by [1, 4] L.Nataraj,
the behaviors of the malware is able to observe. Nonetheless, S.Karthikeyan, G. Jacob and B. S. Manjunath. They convert
dynamic analysis still exist disadvantages. The malware might the malware into images with grey scale. The malware images
have different behaviors in two different environments or some with grey scale are obtained by reading malware in binary. A
behaviors may need to be triggered on specific circumstances. Malware binary is read as a vector of 8 bit unsigned integers
and then arranged into 2D array (Figure 2). We rearrange each
In this paper, a malware classification approach based pixel in the malware images into 3 by 3 grid (Figure 3). We
on image processing and convolutional neural network is convert malware images into 3 by 3 grid in that it is suitable
proposed. First, as Figure 1 demonstrates, each pixel in the for extracting LBP descriptor.
25 10 29 0 0 1 1 2 4 0 0 4
8 bit vector
Malware Binary Binary to 8 bit Threshold Multiply
7 27 30 0 1 128 8 0 108 8
convert to grey
0100101101101… vector
56 41 13 1 1 0 64 32 16 64 32 0
scales image
LBP = 4+8+32+64 = 108
C. TensorFlow Architecture
Malware image
We use TensorFlow [13, 15] for training and testing. As
Figure 5 shows, we use 3x3 convolutional filter with ReLU
and then perform 2x2 max pooling layer with stride 2 to
downsample. The number of first convolutional filter is 16.
The size of second convolutional filter is also 3x3 but with 32
Read each pixel line by line filters. The output of max pooling is multi-dimensional. The
flattening layer is applied to convert multi-dimensional nodes
{{first pixel},{second pixel}…..}
into one dimensional nodes. After flattening output, the fully
connected layer is obtained.
D. Dataset
reorganized each 9 pixels 3 by 3 grid The dataset we use is provided by [1, 4]. This dataset
includes 32 families and around 12000 malware images with
1st pixel 2nd pixel 3rd pixel 10th pixel 11th pixel 12th pixel grey scale (table I). The types of malwares mainly belong
to trojan, password steeler and virus. We use 20% of each
4th pixel 5th pixel 6th pixel 13th pixel 14th pixel 15th pixel malware family dataset for training and the rest for testing.
7th pixel 8th pixel 9th pixel 16th pixel 17th pixel 18th pixel
IV. E XPERIMENTAL R ESULTS
We evaluate Tensorflow for the LBP features classification,
Fig. 3: Reorganized malware image and use LBP features for training Support Vector Machine
(SVM) classifier and k-nearest neighbor (KNN) classifier. We
also implement GIST [9, 10, 11] features with TensorFlow,
KNN and SVM. Table II, table III and table IV are the
B. Local Binary Pattern confusion matrices of Tensorflow, KNN and SVM using LBP
feature. According to the confusion matrices, we discover
The Local Binary Pattern, a visual descriptor, is useful for that the malware belong to family 28, 29 and 30 which are
texture analysis and texture classification [6,7,8]. As Figure Virut.A, Virut.AC and Virut.AT respectively are easy to get
4 demonstrates, the value of central pixel is threshold. The 8 confused. As seen in table II, Tensorflow can differentiate these
neighbors around a pixel are compared with the central pixel. If three with higher accuracy than others. Table V displays the
a neighbor’s value is greater than central pixel, the value of the accuracy of different methodologies over 32 malware families.
neighbor is written ’1’. The value of neighbor which less then
threshold is written ’0’. The threshold results are multiplied V. P ROS AND C ONS
with weights which are given by power of two. The central
value is the sum of the multiplying results. For each pixel in Our approach run with GPU, which is significantly shorter
the image do the same process. The final LBP descriptor can the execution time (Figure 6). Moreover, this method doesn’t
be obtained by calculating the histogram of the image. have to run on a virtual machine or virtual environment
4665
Output
Input image 3x3 Convolutional + 2x2 Max pooling 2x2 Max pooling Flattened layer Fully connected
3x3 Convolutional +
ReLU layer(Stride 1) (Stride 2) ReLU layer(Stride 1) (Stride 2) layer
4666
Classification Method Accuracy classify the results with TensorFlow library. The comparison
TensorFlow+LBP 93.17% over different classifiers and features demonstrates that using
SVM+LBP 87.88% LBP with TensorFlow obtains higher accuracy than others
approaches. Furthermore, extending dataset of malware, con-
KNN+LBP 85.93%
verting malware to RGBA color space, designing different
TensorFlow+GIST 87.88%
architectures of TensorFlow and testing more image descriptors
SVM+GIST 81.23% is our future works, which may improves the research and
KNN+GIST 82.83% obtains more comprehensive methodology.
4667