0% found this document useful (0 votes)
124 views24 pages

First Review B19

The document describes a project that proposes using transfer learning for innovative imaged-based malware detection. The project involves visualizing malware binaries as grayscale images to extract features for malware recognition. The goal is to achieve faster training, reduce overfitting with small datasets, and increase resilience against evolving malware. The project members and supervisor are listed, along with the objectives, scope, abstract, and introduction of the project. System requirements and a literature review summarizing 5 papers on related topics are also provided.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
124 views24 pages

First Review B19

The document describes a project that proposes using transfer learning for innovative imaged-based malware detection. The project involves visualizing malware binaries as grayscale images to extract features for malware recognition. The goal is to achieve faster training, reduce overfitting with small datasets, and increase resilience against evolving malware. The project members and supervisor are listed, along with the objectives, scope, abstract, and introduction of the project. System requirements and a literature review summarizing 5 papers on related topics are also provided.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

SRM Institute of Science and Technology ,

Ramapuram Campus, Chennai-89

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


Envisioning and proposing an Innovative Imaged based
Malware Detection using Transfer Learning
Batch No:19
DETAILS OF THE PROJECT MEMBERS: SUPERVISOR DETAILS

MAHINDER PM RA1911026020088 AMIRTHA LAKSHMI


BHAVANI SAI K B RA1911026020092
MOHAMMED SHAFI S RA1911026020093

Department of Computer Science and


5-Aug-22
Engineering 1
OBJECTIVE
To achieve faster preprocessing and training of
samples. To reduce overfitting with smaller
malware training datasets. To resilient against
sophisticated malware evolution over time and
against anti-malware evasion tactics. To extract
dynamic image features. To extract features from
visualized malware for malware recognition.

Department of Computer Science and


5-Aug-22 2
Engineering
SCOPE
View a malware detection prob problem as a multi-class
image classification problem by visualizing a binary code into
a two dimensional (D) grayscale image. The structure of the
PE binary file (cleanware or malware) is studied by converting
it into an image to provide more information about it. The
binary images corresponding to the same class appear quite
similar in structure and texture where they are distinct
between different classes. The various subsections of a PE
binary are visualized with different textures. The small
modifications made to the binary
Department of Computer Science and
5-Aug-22 3
Engineering
ABSTRACT
The identification and extraction of distinct features for each
malware is another issue for generalizing the malware detection
system. Features that contribute to the generalization capability
of the classifier are difficult to be engineered with modifications
in each malware. Conventional malware detection systems
employ static signaturebased methods and dynamic behavior-
based methods which are inefficient in analyzing and detecting
advanced and zero-day malware. To address these issues this
work employs a visualization approach where malware is (c)
Wisen IT Solutions represented as D images and proposes a
robust machine learning-based anti-malware solution.

Department of Computer Science and


5-Aug-22 4
Engineering
INTRODUCTION
The internet has become a key aspect of our daily lives.
Although making our lives convenient the internet has
made innocent users vulnerable to attacks. The rise of
the internet and the emergence of social networks
have triggered exponential growth in malware.The
increasing number and complexity of malware have
become one of the most serious cybersecurity
threats.The key idea of transfer learning is that the
knowledge gained in learning a model can help to
enhance a different task in learning.

Department of Computer Science and


5-Aug-22 5
Engineering
SYSTEM REQUIREMENTS
Hardware Requirements
Processor: Minimum i3 Dual Core
Ethernet :connection (LAN) OR a wireless adapter (Wi-Fi)
Hard Drive: Minimum 100 GB; Recommended 200 GB or more
Memory (RAM): Minimum 8 GB; Recommended 32 GB or above
Software Requirements
Python, Anaconda ,Jupyter,
Notebook ,TensorFlow ,Keras

Department of Computer Science and


5-Aug-22 6
Engineering
Literature survey - Paper 1
• Title : intelligent Vision-Based Malware Detection and Classification Using Deep
Random Forest Paradigm
• Authors: S. Abijah Roseline , S. Geetha , Seifedine Kadry and Yunyoung Nam
• Year : 2020
• Inference: Adaptation of Vision-based Malware Analysis Technique, where-in the malware
executable files are converted as grayscale images to exploit global features. The risk of
executing the malware for analysis as in Dynamic Analysis, and the requirement of intense
knowledge in opcodes/assembly language to understand the malware codes as in Static
Analysis & reverse Engineering are not involved at all. It involves a clean, straight forward,
simple, and powerful visual feature engineering & extraction approach.
• Advantages: Simple, fast and less complex. Can improve the worst-case performance
Reduce resource consumption while meeting reliability demands
• Disadvantages: Unsuitable for large scale scenarios. It is not an easy-to-use method Poor
Application Performance

Department of Computer Science and


5-Aug-22 7
Engineering
Literature survey - Paper 2
• Title : Robust Intelligent Malware Detection Using Deep Learning
• Authors: R. Vinayakumar , Mamoun Alazab , K. P. Soman , Prabaharan Poornachandran
and Sitalakshmi Venkatraman
• Year : 2019
• Inference: c. Current malware detection solutions that adopt the static and dynamic
analysis of malware signatures and behavior patterns are time consuming and have proven
to be ineffective in identifying unknown malwares in real-time. Recent malwares use
polymorphic, metamorphic, and other evasive techniques to change the malware
behaviors quickly and to generate a large number of new malwares. Such new malwares
are predominantly variants of existing malwares, and machine learning algorithms (MLAs)
are being employed recently to conduct an effective malware analysis.
• Advantages: Found attractive outcomes Performs better on various circumstances and
environment Achieve a well-balanced tradeoff among various parameters.
• Disadvantages: Cannot be changed after configuration Have not been investigated
thoroughly High complexity of installing and maintaining

Department of Computer Science and


5-Aug-22 8
Engineering
Literature survey - Paper 3
• Title : MalJPEG: Machine Learning Based Solution for the Detection of Malicious
JPEG Images
• Authors: Aviad Cohen , Nir Nissim and Yuval Elovici
• Year : 2020
• Inference: JPEG is the most popular image format, primarily due to its lossy
compression. It is used by almost everyone, from individuals to large organizations, and
can be found on almost every device (on digital cameras and smartphones, websites,
social media, etc.). Because of their harmless reputation, massive use, and high potential
for misuse, JPEG images are used by cyber criminals as an attack vector. While machine
learning methods have been shown to be effective at detecting known and unknown
malware in various domains
• Advantages: It provides easy information processing and cost reduction as well.
Reduces the resources used for processing purpose. Boost the Performance

• Disadvantages: Unsuitable for large scale scenarios. Heavyweight Prone to Errors

Department of Computer Science and


5-Aug-22 9
Engineering
Literature survey - Paper 4
• Title : A Malware Detection Method of Code Texture Visualization Based on
an Improved Faster RCNN Combining Transfer Learning
• Authors: Yuntao Zhao , Wenjie Cui , Shengnan Geng , Bo Bo , Yongxin Feng and
Wenbo Zhang
• Year : 2020
• Inference: e, a malware detection method of code texture visualization based on
an improved Faster RCNN (Region-Convolutional Neural Networks) combining
transfer learning is proposed. We utilize visualization technology to map malicious
code into corresponding images with typical texture features, and realize the
classification of malware.
• Advantages: Quick Calculation Time Improve the quality and consistency of data
Reduces the resources used for processing purpose.
• Disadvantages: Heavyweight ,Difficult and Less Commonly used, Prone to Errors

Department of Computer Science and


5-Aug-22 10
Engineering
Literature survey - Paper 5
• Title : Dynamic Analysis for IoT Malware Detection With Convolution Neural
Network Model
• Authors: Jueun Jeon , Jong Hyuk Park and Young-Sik Jeong
• Year : 2020
• Inference: This paper proposes a dynamic analysis for IoT malware detection
(DAIMD) to reduce damage to IoT devices by detecting both well-known IoT
malware and new and variant IoT malware evolved intelligently. The DAIMD
scheme learns IoT malware using the convolution neural network (CNN) model
and analyzes IoT malware dynamically in nested cloud environment.
• Advantages: Continuous security and robustness against attacks Relatively
simple and computationally inexpensive method May not meet the real-time
requirement.
• Disadvantages: Tedious message updating ,Difficult to be used in large-scale
parallel computing. Maximizes the complexity of the problem

Department of Computer Science and


5-Aug-22 11
Engineering
Literature survey - Paper 6
• Title : Multi-Loss Siamese Neural Network With Batch Normalization Layer for Malware
Detection
• Authors: Jinting Zhu , Julian Jang-Jaccard and Paul A. Watters
• Year : 2020
• Inference: we propose a new one-shot model called Multi-Loss Siamese Neural Network
with Batch Normalization Layer that can work with fewer samples while providing high
detection accuracy. Our model utilizes the Siamese Neural Network to detect new variants of
malware that is trained with only a few samples. Our model is equipped with batch
normalization and multiple loss functions to address the overfitting issue, due to the use of
small samples, that can create the vanishing gradient problem as a result of binary cross-
entropy loss, and feature embedding space to improve the detection accuracy.
• Advantages: Minimizes the workload on infrastructures. Flexible with the architectures that
can be used. Trustworthy and reliable, which refers to obtain explainability.
• Disadvantages: Solutions have been proved ineffective Complexity of its Real Time
Implementation This system is Opportunistic and uncontrollable

Department of Computer Science and


5-Aug-22 12
Engineering
Literature survey - Paper 7
• Title : The Image Game: Exploit Kit Detection Based on Recursive Convolutional
Neural Networks
• Authors: Suyeon Yoo , Sungjin Kim and Brent Byunghoon Kang
• Year : 2020
• Inference: we propose a multiclass ConvNet model to classify exploit kits, where we
adopt various image processing techniques and adjust the size and other parameters
of images. The proposed ConvNet model recursively updates images and is designed
for fully preserving image properties. This model updates the output of feature maps
and pooling using an original image. This model was tested using 36,863 real-world
datasets, achieving a 98.2% accuracy in exploit kit detection and family classification.
• Advantages: May not meet the real-time requirement. Simplify the
implementation process. Increased efficiency and speed
• Disadvantages: Signicantly increases capital and operating expenditures Prone to
Errors Solutions have been proved ineffective

Department of Computer Science and


5-Aug-22 13
Engineering
Literature survey - Paper 8
• Title : Comparative Analysis of Low-Dimensional Features and Tree-Based
Ensembles for Malware Detection Systems
• Authors: Seoungyul Euh , Hyunjong Lee , Donghoon Kim and Doosung Hwang
• Year : 2020
• Inference: This paper proposes low-dimensional but effective features for a
malware detection system and analyzes them with tree-base ensemble models.
Expert knowledge and frequency analysis are adapted for relevant feature
selection from the collected data set, which contributes to fast low-dimensional
feature preparation, low storage usage, and fast learning. We extract the five
types of malware features represented from binary or disassembly files.
• Advantages: Lowering the Complexity Threshold Can improve the worst-case
performance Quick Calculation Time
• Disadvantages: Have not been investigated thoroughly Cannot meet current
network business demands Complexity of its Real Time Implementation

Department of Computer Science and


5-Aug-22 14
Engineering
Literature survey - Paper 9
• Title : Adversarial Machine Learning Applied to Intrusion and Malware Scenarios: A
Systematic Review
• Authors: Nuno Martins , Jos Magalhes Cruz , Tiago Cruz and Pedro Henriques Abreu
• Year : 2020
• Inference: The aim of this survey is to explore works that apply adversarial machine
learning concepts to intrusion and malware detection scenarios. We concluded that a
wide variety of attacks were tested and proven effective in malware and intrusion
detection, although their practicality was not tested in intrusion scenarios. Adversarial
defenses were substantially less explored, although their effectiveness was also proven
at resisting adversarial attacks.
• Advantages: Can improve the worst-case performance Excellent empirical performance
Fast and efficient, but also as accurate as the state-of-the-art algorithms

• Disadvantages: Additional configuration is required Resulting data errors. This system


is Opportunistic and uncontrollable

Department of Computer Science and


5-Aug-22 15
Engineering
Architecture Diagram

Department of Computer Science and


5-Aug-22 16
Engineering
ISSUES
Inefficient since malware attackers execute malicious
activities and constantly create zero-day malware.
Works well with fewer data and requires high
computation overhead. Did not identify packed
malware since the entropy measure was high and
patterns were not visualized. Computation burden
may limit its further application for real scenarios.
Difficulties to obtain better performance Cannot
meet current network business demands Generally
have high polynomial running times.
Department of Computer Science and
5-Aug-22 17
Engineering
Algorithm Used
Transfer Learning Algorithm
Advantages
● Improved baseline performance.
● Tackle problems like having little or almost no
labeled data availability.
● State-of-the-Art Performance

Department of Computer Science and


5-Aug-22 18
Engineering
Proposed Methodology
The first convolutional layer takes as input a square gray-scale
image with one channel and outputs data with channels using a
kernel size of three padding of two and a stride of one. A relu
activation and max pooling is applied to the result before passing it
to the second convolutional layer. This second layer outputs data
with channels with the other parameters being the same as the first
convolutional layer. Again relu activation and max pooling is applied
before passing data to the first fully connected layer. This first fully
connected layer outputs a vector of dimension . After applying relu
activation the data is passed to the second fully connected layer
which reduces the output to a dimensional vector. Finally relu
activation is again applied and the data passes to the last fully
connected layer which is used to classify the sample
Department of Computer Science and
5-Aug-22 19
Engineering
Module 1 : Image Preprocessing
The performance of deep learning models is susceptible to the quality and
quantity of data being passed to the model. Raw data as input can barely
account for the best achievable performance of the model due to possible
pre-existing noise and inconsistency in the images. Therefore, a definite
flow of preprocessing is essential to train the model better. The PE binary
files (malware or cleanware) are given as input to the proposed model. Each
file comprises the hexadecimal representation of its binary content. Each
binary is converted using a function that processes hexadecimal into images
in.png format. First each line of a binary is scanned and every set of eight
characters are stored in an array. Then each byte is converted to its decimal
equivalent and stored it in another array. This conversion process is
repeated for all the lines of the binary file. The array with decimal values is
converted to visualize binary images using a Python Imaging Library (PIL)
package.

Department of Computer Science and


5-Aug-22 20
Engineering
Module 2 : Feature Extraction
Neural Network is recognized as the feature learner that typically consists of
two parts and has a tremendous capacity to automatically draw out
essential feature from input data. The first step is feature extractor which
involves a convolutional layer and pooling layers and automatically learn the
characteristics from raw data. Then fully connected layer executes the
classification from the first part relying on the learned attributes. The input
layer is composed of the individual values that denote the smallest unit of
input whereas the output layer comprises as many outputs as categories
exist in the particular classification problem. The convolutional layer
performs an activity of convolution to limited localized regions through
transforming a layer to the preceding layer. This is used specifically for
extracting the feature from the raw data. Pooling layers are employed after
convolutional layers that minimizes the amount of parameters associated
and minimize computational complexity.

Department of Computer Science and


5-Aug-22 21
Engineering
Module 3 :Create Transfer Learning Model
and Predict Malware
The training data was split into training validating and testing sections.
Few Epochs were selected for each iteration. A total of % of the data
was chosen for training while the remaining % was retained for testing
and validation. The learning rate for the last layers parameters is e- and
the learning rate for other layers parameters is set to e-. The cross-
entropy loss and Adam optimizer are used with batch size
Pooling Layer Once the feature maps are detected topology of data has
already preserved in feature maps. Therefore location information
becomes less important. Pooling operation is applied to reduce the
resolution of feature maps and achieve spatial invariant. Each feature
map in the pooling layer is corresponding to specific feature map in
previous layer. Thus the number of feature maps in current layer is the
same as previous layer.
Department of Computer Science and
5-Aug-22 22
Engineering
Fully Connected Layer Fully-connected layer usually comes after several
inner-product layers pooling layers and merging layer. Each node (neuron)
in fully-connected layer connects all nodes in previous layer. The output of
node is computed by summing all the weights multiplying nodes in
previous layer and passing through activation function.
Dropout Layer Dropout is a regularization technique that zeros out the
activation values of randomly chosen neurons during training. This
constraint forces the network to learn more robust features rather than
relying on the predictive capability of a small subset of neurons in the
network. Tompson et al. extended this idea to convolutional networks with
Spatial Dropout which drops out entire feature maps rather than individual
neurons.
Batch Normalization Layer Batch Normalization is another regularization
technique that normalizes the set of activations in a layer. Normalization
works by subtracting the batch mean from each activation and dividing by
the batch standard deviation. This normalization technique along with
standardization is a standard technique in the preprocessing of pixel values.

23
REFERENCES
• S. Ni, Q. Qian, and R. Zhang, Malware identication using visualiza-
Aug. , .
• f. C. C. Garcia, and F. P. Muga, II, (, ). Random forest for , .,
• M. Farrokhmanesh and A. Hamzeh, A novel method for malware
detec- tion using audio signal processing techniques, in Proc.
Artif. Intell
• L. Nataraj, A Signal Processing Approach To malware Analysis.
Santa Barbara, CA, USA: Univ. California,
• H. S. Anderson and P. Roth. (, ). EMBER: An open dataset for abs/
• R. Sihwail, K. Omar, and K. A. Z. Arifn, A survey on malware
analysis techniques: Static, dynamic, hybrid and memory
analysis, Int. J. Adv. Sci.

Department of Computer Science and


5-Aug-22 24
Engineering

You might also like