0% found this document useful (0 votes)
25 views

Malware Detection Using Machine Learning

Uploaded by

Aman Singh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Malware Detection Using Machine Learning

Uploaded by

Aman Singh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 112

MALWARE DETECTION USING MACHINE LEARNING

Project report submitted in partial fulfillment of the requirement for the degree
of Bachelor of Technology

in

COMPUTER SCIENCE AND ENGINEERING


By

Dewesha Sharma (171361)

UNDER THE SUPERVISION OF

Dr. Himanshu Jindal

to

Department of Computer Science Engineering and Information Technology


Jaypee University of Information Technology, Waknaghat, Solan -173234,
Himachal Pradesh
Candidate’s Declaration

I hereby declare that the work presented in this report entitled “Malware Detection using
Machine Learning “ in partial fulfillment of the requirements for the award of the degree of
Bachelor of Technology in Computer Science and Engineering / Information
Technology submitted in the department of Computer Science & Engineering and
Information Technology, Jaypee University of Information Technology Waknaghat is an
authentic record of my own work carried out over a period from August 2020 to December
2020 under the supervision of Dr. Himanshu Jindal , Assistant Professor (Senior Grade),
Department of Computer Science & Engineering and Information Technology.

The matter embodied in the report has not been submitted for the award of any other degree
or diploma.

Dewesha Sharma (171361)

This is to certify that the above statement made by the candidate is true to the best of my
knowledge.

Dr. Himanshu Jindal


Assistant Professor (Senior Grade)
Department of Computer Science & Engineering and Information Technology
Dated : 16 May 2021

i.
ACKNOWLEDGEMENT

We wish to express my sense of gratitude towards Dr. Himanshu Jindal, Department of


Computer Science & Engineering, Jaypee University of Information Technology,
Waknaghat, my guide, for giving me this wonderful opportunity to work with him. We are
grateful for his constant encouragement, motivation, cooperation and support which helped
me in finishing this project successfully. Without his expert guidance this would not have
been possible. We are also grateful to the people whose works we have referred and the
details regarding the same are mentioned in the references. We would also like to thank all
our friends and lab assistants for extending their help and support at times when it was
needed.

ii.
TABLE OF CONTENTS

CERTIFICATE i
ACKNOWLEDGEMENT ii
TABLE OF CONTENTS iii
LIST OF ACRONYMS AND ABBREVIATIONS v
LIST OF FIGURES vi
LIST OF GRAPHS viii
LIST OF TABLE ix
ABSTRACT x

CHAPTER-1: INTRODUCTION 1

1.1 Introduction 1

1.2 Problem Statement 3

1.3 Objective 3

1.4 Malware 3

1.4.1 Types of Malware 4

1.4.2 Types of Malware Analysis 8

1.4.3 Types of Malware Detection 9

1.5 Machine Learning 10

1.5.1 Types of Machine Learning 12

1.5.2 Machine Learning algorithms 13

1.5.3 Performance Measures 15

1.6 Organization 17

CHAPTER-2: LITERATURE SURVEY 19

2.1 Related Work 19

iii.
CHAPTER-3: SYSTEM DEVELOPMENT 21

3.1 Python 21

3.2 Google Colab 22

3.3 Pandas 22

3.4 NumPy 23

3.5 Scikit - learn 24

3.6 Matplotlib 24

3.7 Dataset 24

3.8 Implementation 25

3.8.1 Preparing the dataset 25

3.8.2 Learning Algorithms 31

CHAPTER-4: PERFORMANCE ANALYSIS 34

4.1 Random Forest 34

4.2 KNN 35

4.3 Gradient Boosting 36

CHAPTER-5: CONCLUSIONS 38

5.1 Conclusions 38

5.2 Future Scope 39

5.3 Applications / Contributions 40

REFERENCES 41

iv.
LIST OF ACRONYMS AND ABBREVIATIONS

ML - Machine Learning

AI - Artificial Intelligence

DDoS - Distributed Denial Of Service

Fig. - Figure

Colab - Collaborator

IOT - Internet of things

OS - Operating System

KNN - K nearest neighbor

v.
LIST OF FIGURES

Figure Page No

1 1

2 2

3 2

4 7

5 11

6 14

7 15

8 17

9 21

10 22

11 22

12 23

13 23

14 24

15 24

16 25

17 26

18 26

19 27

20 27

21 28

22 29

vi.
23 29

24 30

25 31

26 31

27 33

28 33

29 34

30 34

31 35

32 35

33 36

34 36

35 36

36 37

37 37

vii.
LIST OF GRAPHS

Graph Page No

1 30

2 32

3 38

4 39

viii.
LIST OF TABLE

Table Page No

1 16

2 20

ix.
ABSTRACT

With technology increasing at a fast pace, the digital world is faced by alarming security
threats and challenges in the form of malware capable of bringing down organizations and
governments. The counter-attacking measures have gotten strong with antivirus companies
increasing the signature database which is regularly updated but they are not that efficient and
fail in case of polymorphic malware. In this project we present an alternative approach of
detecting malicious files by using machine learning algorithms like K-NN, Random Forest
and XGBoost and compare their results to determine the best suitable algorithm for our
dataset.
CHAPTER 1

INTRODUCTION

1.1 Introduction

It has become impossible now to imagine a world without the internet and with the rate at
which technology is growing, the idea of cyborg seems more and more real. Technology is
closely integrated with our lives in the form of mobile phones, laptops, smart watches, money
transactions etc. Extensive amount of work is done in the field of IoT and a lot of research,
time and effort is being devoted into it for building smart houses, smart cities etc. But at the
centre of it all is data and technology isn't completely foolproof as it has its own flaws and
vulnerabilities. For the scope of this project we are only focusing on laptops and desktops.
Out of all the OS available, the majority is still held by windows. According to
NetMarketShare, 76.32% users use windows.

Fig. 1

The vulnerabilities present in the OS are exploited to gain access to organizations/people's


personal data, leaving them compromised and often ends up for sale on dark web or are

1
extorted in exchange of safely recovering the data. One such recent example is WannaCry
which spread through the EternalBlue exploit and ended up affecting almost every country.

Fig.

The cases of cyberattacks increased even more during the pandemic. Following image gives
us a clearer picture about the same.

Fig. 3

2
Current static and dynamic analysis methods fail to detect malwares efficiently, particularly
in case of zero-day attacks or polymorphic malwares. This led us to work towards utilizing
machine learning for developing a much better and efficient model which would be able to
identify a malicious file and help secure our system in a better way.

1.2 Problem Statement

Considering the rise in the cyber attacks and dynamic nature of technology and malware, a
working model capable of detecting malicious files based on certain features needs to be
developed. The scope of the model will be limited to only windows executable files.

1.3 Objective

Following objective will be completed upon successful completion of the project -

 A working model capable of detecting malicious files

1.4 Malware

‘Malicious Software’ or more commonly known as ‘Malware’, is a collective name for


various malicious softwares like Virus, Ransomware, Spyware, Worms, Trojan horses etc.
These malware consist of lines of code and upon execution cause damage and manipulate the
data and the system as per the working of the designed malware. These programs can steal,
encrypt or delete data, modify or hijack basic computer functions and monitor computer
activity without user’s permission.

Earlier malware used to be primitive and the machines were infested through floppy disks but
with the advancement in technology, malware have also advanced and polymorphic malware
being one such example. They continuously change their signature and are able to evade the
signature based detection easily.

3
1.4.1 Types of Malware

Malware are of many different types and each have their own specific purpose and
functionalities. Here we discuss some of them -

 Virus

They are the simplest malware and require human interaction to self replicate. These
are capable of damaging the hardware as well as operating system. They are capable
of performing following functions -

 Changing the file’s size and/or deleting the files


 Formatting the hard disk
 Running down background operations and eventually slowing the system
down

 Worms

Computer worms are similar to viruses but the only difference is that they are able to
execute on their own without any human interaction. They are able to spread via
networks to other computers and replicate themselves in those machines. Worms are
of many different types like -

 E-mail Worms
 File Worms shared in the network
 Internet Worms

 Rootkit

Rootkit enables attackers to do privilege escalation and gain administrative access,


Hence the name. These are extremely difficult to detect and trace as -

4
 They can modify and alter the active processes to evade detection by not
displaying the rootkit processes.
 They are also able to hide themselves from antivirus by changing their
signatures continuously.

 Ransomware

Ransomware prevents the user from accessing the computer by encrypting all the data
present and the decryption is done only after a ransom is paid to the attacker through
some cryptocurrency to avoid any detection from law enforcement agencies. They are
able to -

 Format the whole system in case the ransom is not paid


 Stop all the anti-virus, anti-spyware or other security measures causing
hindrance in its process

 Spyware

As the name suggests, this malware is used to perform espionage i.e spy on the digital
activities of the victim. It monitors and gathers information such as personal records,
websites visited and the actions taken by them when they visit, their credentials etc
via a hidden channel back to the attacker giving them a leverage over the victim.
Sometimes these are also used by the organizations in order to target potential
consumers for promotional purposes leading to increased number of spams. It is
capable of following functions -

 It ends up showing unwanted ads and redirects the victim to desired third party
websites.
 It can cause undesirable changes in the system resulting in reduced security.
 It can also act as a gateway to a backdoor giving hackers remote access to the
system.

5
 Adware

It can also be considered as a subclass of spyware and the only purpose of it is to


display ads and increase traffic to a particular third party site. It also ends up using
RAM and additional memory slowing down the system.

 Trojan

This malware gets its name from a wooden horse used by Greeks in the Trojan war as
just like that wooden horse, a seemingly harmless software or file contains a
malicious hidden file and once executed, it enables the attacker to gain unauthorized
access to the victim's computer.

 Backdoor

As the name suggests, backdoor provides a secret hidden path to the attacker for
gaining unauthorized access to the victim's computer. By itself it is completely
harmless but provides a medium for various exploits and can also act as a zombie bot
used for DDoS attacks.

 Keylogger

It logs all the keystrokes made by the user resulting in all the confidential data being
compromised resulting in possible cases of blackmail or extortion. One of the
advisable methods to protect ourselves from this is by using an onscreen keyboard
while entering our credentials.

 Remote Access Tool

6
It helps the attacker to escalate the privileges and provide remote access of the system
to the attacker. Following functions can be performed -

 Gaining access to your private data


 Spy on you using your webcam and even pickup all your conversations via
microphone
 Block your keyboard or even shutdown or make your system useless
 Install other malware or use it during DDoS attacks

Fig. 4

7
1.4.2 Types of Malware Analysis

Before we get into malware detection, it is necessary to understand the behavior of the
malware. The analysis helps us in observing the behavior and the functionalities of the
malware. There are different ways to achieve the outcome with each having its own benefits
but the time and knowledge required for each varies greatly. These are as follows -

 Static Analysis

Code analysis or also referred to as Static Analysis is achieved by going through the
source code of the malware to determine its potential behaviour and properties. This
reverse engineering can be achieved by one of the following ways -

 File Format Inspection

Metadata is very useful in providing us with many useful information


such as file type, date of creation, compile time, functions that have
been imported and exported etc.

 String Extraction

In string extraction, the output of the code (error messages, status etc)
is examined and based on those parameters the working of the malware
is inferred.

 AV Scanning

The most common way is to do an anti virus inspection on a suspected


file and if it is a well known one then all of them might be able to
identify it.

8
 Disassembly

Another well known and most reliable method is to run the code in a
disassembler which gives us a detailed output about the logics and the
working of the program. IDA Pro and Ghidra are some of the most
widely used disassemblers.

 Dynamic Analysis

Dynamic analysis, also known as behavioural analysis, refers to the process of


examining the behaviour of malware in real time. The file is executed in a virtual
environment like sandbox or virtual machines and all its activities are monitored and a
detailed analysis is done of all the changes and activities done by the file. This type of
analysis is much faster than the static analysis.

 Hybrid Analysis

As the name suggests, both static and dynamic analysis are done in order to gain a
better understanding of the malware. Initially the specific signatures are analyzed and
then dynamic analysis is done for getting the overall understanding of the true nature
and behaviour of the malware.

1.4.3 Types of Malware Detection

Malware detection methods are implemented for malware detection and prevent them from
compromising a system. They are of categorized as follows -

 Signature Based Detection

9
This static method involves the usage of predefined signatures for correctly detecting
the malware. Cryptographic hashes such as SHA1 or MD5, file metadata, static
strings are some of the signatures. Anti- viruses work on the same model. The file is
first analyzed statically by av. In case of a signature match with some other
preexisting malicious signature, the file is immediately flagged as infected. This
method is useful for well known malware but they fail in case of polymorphic
malware as they continuously keep on changing their signature.

 Behaviour Detection

Also known as heuristics based analysis, it involves observing the behaviour of a file
during execution and flagging it as malicious if found suspicious. Modifying of host
files or registry keys may seem harmless but a combination of such activities is
definitely a point of concern and any file exceeding a certain threshold raises an alert.
The best way to implement this is via virtual environment. Although it will take more
time, it is a much safer and foolproof option as compared to signature based detection.

 Feature Detection

It can be seen as an application of derivative based detection and is able to overcome


the false alarms associated with behaviour detection. This method is similar to
identifying the anomalies but different also as it utilizes the characteristics which have
been manually developed instead of using ml algorithms.

1.5 Machine Learning

The central idea of machine learning is to build a model that is capable of receiving input
data and use its statistical analysis to predict the correct outcome. The model is capable of
learning on its own. It can also be seen as a subclass of ai. The basic idea behind this is to
train a model to produce some output based on a certain algorithm. A training dataset is
provided and the model built on that dataset is used to make predictions.

10
Fig.5

The process consists of 5 stages -

 Data Intake

Dataset gets loaded from file and saved in the memory

 Data Transformation

Loaded dataset is made suitable for algorithm via transformation and normalization.
Data is converted to make it lie in the same range and same format. Feature selection
and extraction are performed and data is separated into training set and test set.
Training set is used to build our model and the test set is later used for evaluation.

 Model Training

Our proposed model is build using the selected algorithm

 Model Testing

The built model is trained using our training set and the results are used to built new
model which learns from our previous ones.

11
 Final Model

The best model out of all is selected after the required results are achieved or after a
certain number of iterations.

1.5.1 Types of Machine Learning

There are two approaches to machine learning and they are as follows -

 Supervised learning

In this type of learning, the dataset is mapped to desired outcome and our model gets
trained on this dataset where all the factors are known to it. The variables are pre
determined and predictions are done based on that variable only.

 Unsupervised learning

In unsupervised learning, no predetermined variable is set for our model to train on.
These are also known as neural networks and work by collecting and combining the
training set with data enabling it to be used for interpreting new data.

 Semi-Supervised Learning

The semi-supervised machine learning algorithms uses advantages of both supervised


and unsupervised machine learning algorithms for training the dataset and are able to
produce highly productive and powerful classifiers.

12
 Reinforcement Learning

Reinforcement learning is a reward based learning in which model directly interacts


with the environment and discovers errors.

1.5.2 Machine Learning Algorithms

Following machine learning algorithms have been used by us for our project -

 K - nearest neighbor

This is one of the simplest and most accurate learning algorithm. KNN doesn't form
any assumptions on its own about the data structure as in real life scenarios theoretical
assumptions are rarely obeyed. The model is just like a dataset and no learning is
required for the same. KNN is implemented in the following ways -

 Value of k in relation to number of training examples is determined


 Class of each example in test set is determined by calculating the similarities
between the test and all the other examples
 First k examples most similar to current example are selected and the class of
test example is determined using majority vote
 Information contained in the test file is checked based on the classification
 In case of more test samples go to step 3 and start again
 The quality is calculated using accuracy, true negative rate etc for the current
value of k

13
Fig. 6

 Random Forest

It is one of the most popular algorithms as it results in accurate results even without
data preparation or modelling. A collection of decision trees results in random forests
which results in increased accuracy of prediction. Its algorithms can be described as
follows -

 Two third data is chosen randomly and trees are built on it


 Predictor variables are selected randomly and the best split is used to split
these nodes
 The number of selected variables is constant for all trees and it is the square
root of all predictors
 Rest of data is used for calculating the misclassification rate and is calculated
as the overall out-of-bag error rate
 Vote is given by each trained tree to its classification result and the most voted
is selected as the result

14
Fig.7

 Gradient Boosting

Gradient boosting is a technique specifically used for regression and classification


problems and it produces a model consisting of weak prediction models which are
basically decision trees.

1.5.3 Performance Measures

There are various methods to evaluate the performance of the algorithms. One of these
methods is to determine the area under the curve or the ROC curve and other parameters
which are also known as Confusion Metrics. To evaluate the performance measure of the
classification model for a dataset that gives the true values are known, the confusion matrix
table is used.

15
Table 1

The table shown above is known as the confusion matrix and has four sections. The two
sections in the green are the True Positive and True Negative and these are the observations
which are correctly predicted. The other two sections are in red because these values are
wrongly predicted and thus need to be minimized. These sections are false negative and False
Positive respectively and occur when there is a contradiction between actual class and the
predicted class.

 True Positives (TP)

These are the values which are correctly predicted and are positive values which can
be described as the positive value of actual class and positive value of predicted class.
It is denoted by TP.

 True Negatives (TN)

These are the values which are correctly predicted but negative values
which refers to the negation of actual class and negation of predicted class.
It is denoted by TN.

16
 False Positives (FP)

These are the values which are wrongly predicted but is true in reality i.e. -
when we have positive values of actual class but negation in predicted
class.

 False Negative (FN)

These are the values which are wrongly predicted and negative in actual
class.

Fig. 8

1.6 Organization

While going through this project you will mainly come across these components -

17
 Chapter 1 gives us a basic idea about the project and helps you familiarize with the
necessary technical and theoretical aspects of our project
 Chapter 2 consists of review of other journals and research papers
 Chapter 3 tells us about the system design and various tools and techniques needed to
achieve the same
 Chapter 4 gives us a comparative study of the performance of each model making it
easier for us to choose the best one most suitable for us
 Chapter 5 provides us with the conclusion and also tell us about the future scope of
our project

18
CHAPTER 2

LITERATURE SURVEY

Literature survey consists of analysing other theoretical works which have been published
beforehand. This is a very essential part of the project as it helps us to compare different
pieces of literature and studying and analysing them helps us to draw very important
inferences. These experiments have been conducted over time by many different experts and
analysts and each one of them are unique and different from each other.

These resources are very useful for budding researchers as they can study these and
understand the work done in their field upto a given point of time. It can help us by providing
us with a specific path we need to work towards.

2.1 Related Work

Some of the works in this field have utilized string or file formats properties for feature
representation. The data of PE headers are used for analysing malwares specific to Windows
platform. But it isn't the best way to solve the problem as formats of these files can vary
drastically(Hung[2]).

Reddy and Pujari [3] have also done breakthrough work in this field. They utilized n-grams
for attaining the desired outcome. Byte n-grams are basically overlapping substrings which
have been collected in a sliding window fashion. This technology along with the word n-
gram and character n-gram are very commonly used in nlp.

This approach has its own drawbacks too. One of the biggest difficulty is that the set of byte
strings and the programs is extremely large and classification techniques are unable to be
implemented directly.(Reddy and Pujari [3])

In the works done by Kateryna Chumachenko [4], important features were selected and the
model was trained using the hashes of well known malware and it yielded a positive result.

19
Windows API’s, hashes and other features were used to analyze it and sandbox testing was
also done which lended strength to the results obtained from the model.

Table 2

20
CHAPTER 3

SYSTEM DEVELOPMENT

The project was completed in the required time duration in a stepwise manner. Following
timeline gives us a better understanding of the stages in which our project was divided and
the time taken to complete that stage of development.

Fig. 9

We have used many Tools and technologies used for our project and all are discussed here -

3.1 Python

 Python is a high level, general purpose programming language developed in 1991 by


Guido van Rossum.

21
 It is extremely favourable because of its code readability. It is an object oriented
language having a large number of libraries and modules making it one of the go to
languages in various fields.
 It has important libraries like Pandas and Numpy which are extensively used in the
field of deep learning, machine learning and ai as they help in visualizing the data.

Fig. 10

3.2 Google Colab

 Google Colab is a powerful software made by Google Inc which is used by many
Data Scientists for visualization of Data as well as preprocessing the data.
 It provides a free software platform where various Machine Learning algorithms can
be implemented.
 Free Access to Industry Grade GPU’s and Cpu processing is provided by Google .

Fig. 11

3.3 Pandas

 This happens to be the most sought after tool for data scientist
 This python library is useful in data analysis and manipulation

22
 It helps in manipulation, transformation, cleaning and analysis of data

Fig. 12

3.4 Numpy

 It is python library useful for numerical analysis


 The library consists of matrix ds and multidimensional arrays
 Also has functions capable of performing linear algebra, fourier transformation and
matrices

Fig. 13

3.5 Scikit - learn

23
 It is a free ml python library and one of the most useful one
 It contains many tools useful in modeling such as classification, clustering, regression
etc.
 It provides many learning algorithms through a consistent interface.

Fig. 14

3.6 MatplotLib

 It is a python library used for plotting of graphs or other visuals.


 These visuals can be static, animated or even interactive

Fig. 15

3.7 Dataset

In order to train our model, we selected a dataset curated by a security blogger Prateek
Lalwani. The dataset consists of features which are extracted from the following sources

 41,323 Windows binaries (.exe and .dlls) as legitimate files


 96,724 malware files which are downloaded from VirusShare website

24
In total, our dataset consists of 1,38,048 lines. As is visible from the pie chart below the ratio
of legitimate to malicious is approximately 1:3.

Fig. 16

3.8 Implementation

As per our ongoing discussion, some of the security threats and challenges faced by the
digital world have been discussed in depth. Keeping them in mind we will go with a static
analysis method for achieving the desired outcome.

3.8.1 Preparing the dataset

 Importing Dataset

Since we are working on colab , the first steps towards processing our data and
implementing the model would be to import the dataset.

25
Fig. 17

 Extracting features

Our dataset in total consists of 56 features.

Fig. 18

26
Fig. 19

As mentioned above our total dataset consists of 1,38,048 lines and the first 41,323
lines consists of legitimate files and their features whereas the rest of the dataset
contains malicious files. Legitimate files have legitimate value as ‘1’ whereas
malicious files have legitimate value ‘0’.

Fig. 20

27
In the above figure we can clearly see that the legitimate value is 1 which implies that
the data is of a legitimate file.

 Organizing dataset

After importing the data and extracting the features, we organize the dataset into legit
and mal sets.

Fig. 21

 Feature Selection

After organizing and dividing the dataset, we move towards selecting the most
important features from our dataset. The dataset consists of 56 features but not all will
be of that much importance. So we use tree based feature selection to assign weight to
features and select the most important ones out from the 56 features.

28
Fig. 22

We import the necessary libraries and proceed towards selecting the features.

Fig. 23

From the above output image we can see that out of the total 54 features ( removing
the name, md5 and legitimate column as they are not necessary in our scenario) only
14 were important and selected using the tree based feature selection. We can also see
the features selected and the weight assigned to each one of them.

29
Fig. 24

We get the list of features selected along with their weightage. ‘DllCharacteristics’
having the highest weightage of 0.16 to ‘SectionsMinEntropy’ having the lowest
weightage out of all the selected features i.e 0.018. We can visualize this as a graph
also.

Graph 1

 Splitting the dataset

30
After feature selection we move towards splitting the dataset into training and testing
sets. We can divide the dataset into any ration but here we go with the 80:20 ratio i.e.
80% training size and 20% test size.

Fig. 25

3.8.2 Learning Algorithms

In this we discuss the algorithms that have to be implemented.

 Random Forest

First algorithm we are going to implement is Random Forest. In order to successfully


implement this algorithm it is necessary to clearly specify the estimators. For our
project we have set the value of the estimator to 50.

Fig. 26

31
 KNN

The next algorithm we are going to implement is knn. In order to implement this
algorithm, we first need to identify the value of k or the nearest neighbours that will
be considered for identifying the file. In order to obtain the most optimum value of k,
we use the elbow curve method.

Graph 2

We check the error percentage for each value of k ranging from 1 to 10. See and the
minimum error value is at k = 3. So we implement the knn algorithm using the value
3.

32
Fig. 27

 Gradient Boosting

In order to implement the gradient boosting algorithm, we need to specify the


estimator. So we select the estimator value as 50 and implement the gradient boosting
algorithm.

Fig. 28

33
CHAPTER 4

PERFORMANCE ANALYSIS

In order to compare and analyze the performance of our algorithms we take the use of
confusion matrix and also take into consideration the accuracy of each algorithm.

4.1 Random Forest

Random forest was successfully implemented, so we look at the confusion matrix of the
same.

Fig. 29

We get the above confusion matrix for random forest algorithm and as is visible from it, the
false positives and false negatives are very low or negligible as compared to true positives or
true negatives.

Fig. 30

34
Fig. 31

Based on the above confusion matrix, we get an accuracy of 99.46 % for the random
forest algorithm.

4.2 KNN

For knn algorithm the confusion matrix is as follows -

Fig. 32

As we can see the number of false positives and false negatives has increased as compared to
the random forest algorithm. On calculating the percentage of false negatives and false
positives we get the following value -

35
Fig. 33

Based on the above confusion matrix, we get the following accuracy value for knn algorithm

Fig. 34

4.3 Gradient Boosting

In order to verify the accuracy of gradient boosting algorithm we first take a look at its
confusion matrix.

Fig. 35

36
Based on the above confusion matrix if we calculate the percentage of false negatives and

false positives their value is still low but high as compared to other two algorithms.

Fig. 36

Fig. 37

Based on the above confusion matrix, we achieve the accuracy of 98.75% .

37
CHAPTER 5

CONCLUSIONS

5.1 Conclusions

We plot a graph of false positives and false negatives for better visualization of the results.

Graph 3

Based on the graph we can conclude that random forest will have higher accuracy as
compared to the other two algorithms. After successfully implementing all the algorithms we
can infer that the highest accuracy of 99.45% was achieved by random forest followed by knn
achieving an accuracy of 99.06%. Gradient boosting comes at last with an accuracy of
98.75%. The result can be conceptualized better as a graph.

38
Graph 4

Hence the best algorithm out of all the algorithms implemented for our project is random
forest with an accuracy of 99.45%.

5.2 Future Scope

Considering the fast changing pace of technology a lot can be done to further improve upon
our proposed model

 The Algorithms implemented are sufficient by themselves but nowadays Neural

Networks play an important role in classification problems.

 Neural Networks can be implemented instead of implementing machine learning

algorithms which are much better in terms of unsupervised learning.

 Specific feature selection can also be done in order to remove false positives

39
5.3 Applications

 In the field of security this has wide and far reaching implications as it can be used to
detect malicious files and many av companies are working towards incorporating it in
their software.

40
REFERENCES

1.Firdausi, I., Erwin, A. and Nugroho, A.S., 2010, December. Analysis of machine learning
techniques used in behavior-based malware detection. In 2010 second international
conference on advances in computing, control, and telecommunication technologies (pp. 201-
203). IEEE.

2.Gavriluţ, D., Cimpoeşu, M., Anton, D. and Ciortuz, L., 2009, October. Malware detection
using machine learning. In 2009 International Multiconference on Computer Science and
Information Technology (pp. 735-741). IEEE.

3.Amos, B., Turner, H. and White, J., 2013, July. Applying machine learning classifiers to
dynamic android malware detection at scale. In 2013 9th international wireless
communications and mobile computing conference (IWCMC) (pp. 1666-1671). IEEE.

4.Narudin, F.A., Feizollah, A., Anuar, N.B. and Gani, A., 2016. Evaluation of machine
learning classifiers for mobile malware detection. Soft Computing, 20(1), pp.343-357.

5.Ahmed, F., Hameed, H., Shafiq, M.Z. and Farooq, M., 2009, November. Using spatio-
temporal information in API calls with machine learning algorithms for malware detection. In
Proceedings of the 2nd ACM workshop on Security and artificial intelligence (pp. 55-62).

6.Santos, I., Devesa, J., Brezo, F., Nieves, J. and Bringas, P.G., 2013. Opem: A static-
dynamic approach for machine-learning-based malware detection. In International Joint
Conference CISIS’12-ICEUTE 12-SOCO 12 Special Sessions (pp. 271-280). Springer, Berlin,
Heidelberg.

7.Ham, H.S. and Choi, M.J., 2013, October. Analysis of malware detection performance
using machine learning classifiers. In 2013 international conference on ICT Convergence
(ICTC) (pp. 490-495). IEEE.

8. Reddy, Krishna Sandeep, and Arun Pujari. 2006. N-gram analysis for computer virus
detection.

41
9. Hung, Pham Van. 2011. An approach to fast malware classification with machine learning
techniques.

42
Dewesha_Malware_Detection.d
ocx
by

Dewesha Sharma (171361)

Submission date: 15-May-2021 11:50PM (UTC+0530)


Submission ID: 1586749405
File name: Dewesha_Malware_Detection.docx (2.4M)
Word count: 5709
Character count: 28491
1
Candidate’s Declaration

1 hereby declare that the work presented in this report entitled “Malware Detection using
Machine Learning “ in partial fulfillment of the requirements for the award of the degree of
Bachelor of Technology in Computer Science and Engineering / Information Technology
submitted in the department of Computer Science & Engineering and Information Technology,
Jaypee University of 1nformation Technology Waknaghat is an authentic record of my own
work carried out over a period from August 2020 to December 2020 under the supervision of
Dr. Himanshu Jindal , Assistant Professr›r (Senior Grade), Department of Computer Science
& Engineering and Information Technology.

The matter emb‹x1ied in the report has not been submitted for the award of any other degree or
diploma.

Dewesha Sharma

171361

This is to certify that the above statement made by the candidate is true to the best of my
knowledge.

Dr. Himanshu I indal


Assistant Professor (Senior Grade)
Depanment of Computer Science & Engineering and Information Technology
Dated :
1

1
10

4
8
ML - Machine Learning

AI - Artificial Intelligence

DDnS - Distributed Denial Of Service

Fig. - Figure

Colab - Collaborator

IOT - Internet of things

OS - Operating System

KNN - K nearest neighbor

LIST OF FIGURES
Figure
Page No

6 14

15

9 21
I0 22
II
22
12
23
13 23
14
24

24
16
25
17
26
26
19 27
20 27
21
28
22
29

23 29
24 30

25 3I
26 31
27 33

33
29 34
30 34

3I 35

32 35
33 36
34 36
35 36
36 37

37 37

LIST OF GRAPHS
C’.raph Page No

30

32
3 38
4 39

LIST OF TABLE
Table Page No

16

2 20

ABSTRACT
1

1
13
Cao I Recaxer Ny Files?

f4ow Do I P•jr2

Fig.

The cases of cyberattacks increased even more during the pandemic. Following image gives
us a clearer picture about the same.

* Aa of May Znd, the FBI Reported a BI 09s.increase in


reported cybar crimee - entrepreneur.com
Aa of Mamh ۖ, the nw'nbar of cybar attacks ratate0 to
corooa¥iñze grew from a few Ixindmd daily to over 6V In
one day alone. - theraxtweb4om

ioaai
A flSB96 k+creaee In cyber ettacka agalnet banks Ie linked
to COVID-19 - ZDNET.COM

Fig. 3

Current static and dynamic analysis methods fail to detect malwares efficiently, particularly in
case of zero-day attacks or polymorphic malwares. This led us to work towards utilizing

2
6
Malware are of many different t ypes and each have their own specific purpose and
functionalities. Here we discuss some of them -

Virus

They are the simplest malware and require human interaction to self replicate. These
are capable of damaging the hardware as well as operating system. They are capable of
performing following functions -

• Changing the file’ s size and/or deleting the files


0 Formatting the hard disk
0 Running down background operations and eventually slowing the system down

Computer worms are similar to viruses but the only difference is that they are able to
execute on their own without any human interaction. The y are able to spread via
networks to other computers and replicate thernsel ves in those machines. Worms are of
many different types like -

0 E-mail Worms
0 File Worms shared in the network
0 Internet Worms

Rootkit

Rootkit enables attackers to do priv’ilege escalation and gain administrative access,


Hence the name. These are extremely difficult to detect and trace as -

• The y can modify and alter the active processes to es’ade detection by not
displaying the rootkit processes.
0 The y are also able to hide themsel›’es from antivirus by changing their
signatures continuously.

4
Ransom ware

Ransomware prevents the user from accessing the computer by encrypting all the data
present and the decryption is done only after a ransom is paid to the attacker through
some cryptocurrenc y to avoid an y detection from law enforcement agencies. The y are
able to -

0 Format the whole system in case the ransom is not paid


0 Stop all the anti-virus, anti-spyware or other securit y measures causing
hindrance in its process

Spyware

As the name suggests, this malware is used to perform espionage i.e spy on the digital
actis’ities of t he victim. lt monitors and gathers information such as personal records,
websites visited and the actions taken by them when they visit, their credentials etc via
a hidden channel back to the attacker giving them a 1es’erage over the victim.
Sometimes these are also used by the organizations in order to target potential
consumers for promotional purposes leading to increased number of spams. lt is capable
of following functions -

• lt ends up showing unwanted ads and redirects the victim to desired third party
websites.
0 lt can cause undesirable changes in the system resulting in reduced securit y.
0 lt can also act as a gateway to a backdoor giving hackers remote access to the
system.

Adware

5
lt can also be considered as a su bclass of spyware and the only purpose of it is to
display ads and increase traffic to a particular third part y site. lt also ends u p using
RAM and additional memory slowing down the system.

This malware gets its name from a wooden horse used by Greeks in the Trojan war as
just like that wooden horse, a seemingly harmless software or file contains a malicious
hidden file and once executed, it enables the attacker to gain unauthorized access to the
victim's computer.

Backdoor

As the name suggests, backdoor provides a secret hidden path to the attacker for gaining
unauthorized access to the victim's computer. By itself it is cnmpletel y harmless but
pros’ides a medium for various exploits and can also act as a zombie txit used for DDoS
attacks.

Keylogger

lt logs all the keystrokes made by the user resulting in all the confidential data being
compromised resulting in possible cases of blackmail or extortion. One of the ads’isable
methods to protect ourselves from this is by using an onscreen keyboard while entering
our credentials.

Remote Access Tool

lt helps the attacker to escalate the priv’ileges and provide remote access of the system
to the attacker. Following functions can be performed -
2
2
11
for well known malware but they fail in case of polymorphic malware as they
continuously keep on changing their signature.

Behaviour Detection

Also known as heuristics based analysis, it invols’es observing the behaviour of a file
during execution and flagging it as malicious if found suspicious. Modifying of host
files or registry keys may seem harmless but a combination of such activities is
definitely a }xiint of concern and any file exceeding a cenain threshold raises an alert.
The best way to implement this is via v’inual ens’ironrnent. Although it will take more
time, it is a much safer and foolproof option as compared to signature based detection.

Feature Detection

lt can be seen as an application of derivative based detection and is able to overcome


the false alarms associated with behaviour detection. This method is similar to
identifying the anomalies but different also as it utilizes the characteristics which
have been manually developed instead of using ml algorithms.

lfi Machine Learning

The central idea of machine learning is to build a model that is capable of receiving input data
and use its statistical analysis to predict the correct outcome. The model is capable of learning
on its own. lt can also be seen as a subclass of ai. The basic idea behind this is to train a model
to produce some output based on a certain algorithm. A training dataset is pros’ided and the
model built on that dataset is used to make predictions.

10
2

2
1
1

12

2
2

2
9

1
1
These are the values which are wrongly predicted but is true in realit y i.e. -
when we have positive values of actual class but negation in predicted class.

• False Negative (FN)

These are the values which are wrongly predicted and negative in actual
class.

rolcvo nt cfc me nts

Fig. h

1.6 Organization

While going through this project you will mainly come across these components -

G Chapter 1 gi›’es us a basic idea about the project and helps you familiarize with the
necessary technical and theoretical aspects of our project
• Chapter 2 consists of review of other journals and research papers

17
G Chapter 3 tells us atx›ut the system design and various tools and techniques needed to
achieve the same
• Chapter 4 giv’es us a comparative st udy of the performance of each model making it
easier for us to choose the best one most suitable for us
G Chapter 5 provides us with the conclusion and also tell us about the future scope of
our project

CHAPTER 2

LITERATURE SURVEY

18
2

2
Classifl«• KNN SVM Naive Bayes J48 Random Forest

Performance

Accuracy 87% 87.6% 72.34% 93.3% 95.69%

Accuracy 94.6% 94.6% 55% 94.6% 96.8%


False- 12 20 0 15 9
positives
False- 8 0 167 5 3
=
negatives
True- 302 310 143 305 307
positives
True- 49 41 61 46 52
negatives
Table 2

CHAPTER 3

SYSTEM DEVELOPMENT

20
2
F1g.l()

3d Google Colab

G Google Colab is a powerful software made by Google Inc which is used by many Data
Scientists for s’isua1ization of Data as well as preprocessing the data.
• lt provides a free software platform where s’arious Machine Learning algorithms can
be implemented.
G Free Access to Industry Grade GPU’s and Cpu processing is provided by Google .

Fi g. 11

3.3 Pandas

G This happens to be the most sought after tool for data scientist
G This python librar y is useful in data analysis and manipulation
G lt helps in manipulation, transformation, cleaning and anal ysis of data

22
Fi g. 12

3.4 Numpy

• lt is python library useful for numerical analysis


• The library consists of matrix ds and multidimensional arrays
G Also has functions capable of performing linear algebra, fourier transformation and
matrices

NumPy
Fi g. 13

3J Scikit - learn

G lt is a free ml python library and one of the most useful one


G lt contains many tools useful in modeling such as classification, clustering, regression
etc.
• lt prov’ides many learning algorithms through a consistent interface.

23
5

5
Fi g. 16

3.8 Implementation

As per our ongoing discussion, some of the securit y threats and challenges faced by the digital
world ha›’e been discussed in depth. Keeping them in mind we will go with a static analysis
method for achies’ing the desired outcome.

3.8.1 Preparing the dataset

 Importing Dataset

Since we are working on colab , the first steps towards processing our data and
implementing the mndel would be to import the dataset.

25
Fig. 17

 Extracting t‘eatures

Our dataset in total consists of 56 features.


Fig. 19

As mentioned above our total dataset consists of I ,35,048 lines and the first 41 ,323
lines consists of legitimate files and their features whereas the rest of the dataset
contains malicious files. Legitimate files have legitimate value as ‘1’ whereas malicious
files have legitimate value ‘ 0’.

27
In t he above figure we can clearly see that the legitimate value is I which implies that
the data is of a legitimate file.

Organizing dataset

After irn}xirting the data and extracting the features, we organize the dataset into legit
and mal sets.

Fi g. 21

Feature Selection

After organizing and dividing the dataset, we move towards selecting the most
important features from our dataset. The dataset consists of 56 features but not all will
be of that much importance. So we use tree based feature selection to assign weight to
feat ures and select the most irn}xirtant ones out from the 56 features.

28
Fi g. 22

We i report the necessary libraries and proceed towards selecting t he features.

Fi g. 23

From the above output image we can see that out of the total 54 features ( removing
the name, md5 and legitimate cnlumn as the y are not necessary in our scenario) only
14 were important and selected using the tree based feature selection. We can also see
the features selected and the weight assigned to each one of them.

29
Fi g. 24

We get the list of features selected along with their weightage. ‘DllCharacteristics
having the highest weightage of 0. I fi to ‘SectionsMinEntropy’ having the lowest
weightage out of all the selected features i.e 0.018. We can v’isua1ize this as a graph

016

0 10

006
00s

000
10 20
Feature Label

Gruph 1

0 Splitting the dataset

30
9
• KNN

The next algorithm we are going to implement is knn. In order to implement this
algorithm, we first need to identify the v’alue of k or the nearest neighbours that will be
considered for identifying the file. In order to obtain the most optimum value of k, we
use the elbow curs’e method.

0 01125

0 01100

0 01075

0 01 050

0 01025

0 01000

0 00975

0 00950

25 50 75 10 0 12 5 15 0 17 5
K

We check the error percentage for each value of k ranging from I to 20. See and the
minimum error value is at k = 3. So we implement the knn algorithm using the v’a1ue 3.

32
CHAPTER 4

PERFORMANCE ANALYSIS

In order to compare and analyze the performance of our algorithms we take the use of confusion
matrix and also take into consideration the accurac y of each algorithm.

4.1 Random Forest

Random forest was successfully implemented, so we look at the confusion matrix of the

Fi g. 29

We get the abov’e confusion matrix for random forest algorithm an ble from it, the
false positiv’es and false negatives are very low or negli gible as compared to true positives or
true negatives.
3
3

3
3
CHAPTER 5

CONCLUSIONS

5.1 Conclusions

We plot a graph of false positis’es and false negatives for better visualization of the results.

False Positives False Negatives


18

16

1 4

08

06

04
Random For est KNN Gradient Boosting

Graph 3

Based on the graph we can conclude that random forest will have higher accurac y as compared
to the other two algorithms. After successfully implementing all the algorithms we can infer
that the highest accuracy of 99.45°/e was achieved by random forest followed by knn achieving
an accuracy of 99.06*/o. Gradient boosting comes at last with an accurac y of 98.75*/e. The
result can be conceptualized better as a graph.

38
Random Forest KNN Gradient Boosting

Graph 4

Hence the best algorithm out of all the algorithms implemented for our project is random
forest with an accurac y of 99.45°/e.

5d Future Scope

Considering the fast changing pace of technolog y a lot can be dnne tn further improve u pon
our proposed model

G The Algorithms implemented are sufficient by themselves but nowadays Neural

Networks play an important role in classification problems.

• Neural Networks can be implemented instead of implementing machine learning

algorithms which are much better in terms of unsupervised learning.

• Specific feature selection can also be done in order to remov’e false positives

39
5d Applications

G ln the field of security this has wide and far reaching implications as it can be used to
detect malicious files and many ay companies are working towards incorporating it in
their software.

40
REFERENCES

I .Firdausi, 1., Erwin, A. and N ugroho, A.S., 2010, December. Analysis of machine learning
techniques used in behavior-based malware detection. In 2010 .ie‹ end intro nution‹il
‹ ‹›n[rreii‹ r ‹›ri nd on‹ es in ‹ omf utinp, ‹ ‹›ritr‹›l, und trle‹ ‹›mmurii‹oti‹›ii try hii‹›l‹› pies (pp.
20 I - 203). IEEE.

2. GavriluJ, D., Cimpoe§u, M., Anton, D. and Ciortuz, L ., 2009, October. Malware detection
using machine learning. In 2009 lnternotionol Multi on[eren‹ e on C"omputer S ten -r and
In[‹›imoti‹›ii Tet hn‹›l‹› p (pp. 735-741) . 1EEE.

3. Amos, B., Turner, H. and White, J., 2013, July. Applying machine learning classifiers to

dynamic android malware detection at scale. In 2013 9th inter notional wirele.s.
‹ ‹›mmuni‹ oti‹›ns and mohilr ‹‹›mf utinp ‹‹›ri[rt rn‹ r {1WC“M€“) (pp. 1 666- 167 I ). IEEE.

4. Narudin, F.A., Feizollah, A., Anuar, N.B. and Gani, A., 2016. Es’aluation of machine
learning classifiers for mobile malware detection. .Sr›/r €"omputinp, 20{ I ), pp.343-357.

5. Ahmed, F., Hameed, H., Shafiq, M J. and Faror›q, M., 2009, November. Using spatio-

temporal information in API calls with machine learning algorithms for malware detection. In
Pt ‹›‹ ridings r›5the 2nd A€"M ward h‹›p ‹›n Sr‹ ui iiv ‹ind urti[i‹ rail intellig rn‹ r (pp. 55-62).

6. Santos, 1., Devesa, J., Brezo, F., N ieves, J. and B ringas, P.G., 2013. Opern: A stat ic-

dynamic approach for machine-learning-based malware detection. In Internoiiorui/ .f‹›inr


C‹›n{eren‹e CIVIL’12-I€"EUTE /2-.’OC’D 12 Spr‹ tel ñ’rssi‹›n.t pp. 27 I -280) . Springer,
Berlin, Heidelberg.

7. Ham, H.S. and Choi, MH ., 2013, October. Analysis of malware detection performance

using machine learning classifiers. ln 2013 iniernoti‹›nol ‹‹›rJ{rrror r ‹›n l€"T €’onverprn‹ r
(I€"T€") (pp. 400-495). IEEE.

41
8. Reddy, Krishna Sandeep, and Arun Pujari. 2006. N-gram analysis for computer virus
detection.

9. Hung, Pharn Van. 20 I I . An approach to fast malware classification with machine learning

techniques.

42
Dewesha_Malware_Detection.docx
ORIGINALITY REPORT

15
SIMILARITY
INDEX
% 8%
INTERNET
6%
PUBLICATION
11%
STUDENT PAPERS
SOURCES S

PRIMARY SOURCES

Submitted to Jaypee University of


1 Information Technology
Student Paper
9%
2 www.theseus.fi
Internet Source
2%
"Information and Communications
<1 %
3 Security", Springer Science and Business
Media LLC,
2020
Publication

4 docplayer.net
Internet Source
<1
%
Submitted to University of Bradford
<1
5
Student Paper

%
6 Submitted to Newman College

<1
Student Paper

7 Submitted to The New Art College %


Student Paper

8 Submitted to Coventry University <1


Student Paper
%

<1
%
9 Vaibhav Verdhan. "Supervised Learning
with Python", Springer Science and
<1
Business Media LLC, 2020 %
Publication

10 pt.scribd.com
Internet Source
<1
11 link.springer.com %
Internet Source

12 www.ijcrd.com
<1
Internet Source %

<1
13
www.scititles.com
Internet Source
%

<1
%

Exclude quotes On Exclude matches < 10 words


Exclude bibliography On
JAYPEE UNIVERSITY OF INFORMATION TECHNOLOGY,
WAKNAGHAT
PLAGIARISM VERIFICATION REPORT
Date: 16 JUNE 2021…………………….
Type of Document (Tick): B.Tech Project
Report
Name: Dewesha Sharma Department: Computer Science Engineering and
Information Technology Enrolment No171361 Contact No. 9816280648
E-mail. [email protected] Name of the Supervisor: Dr. Himanshu Jindal
Title of the Thesis/Dissertation/Project Report/Paper (In Capital letters):
MALWARE DETECTION USING MACHINE LEARNING

UNDERTAKING
I undertake that I am aware of the plagiarism related norms/ regulations, if I found guilty
of any plagiarism and copyright violations in the above thesis/report even after award of
degree, the University reserves the rights to withdraw/revoke my degree/report. Kindly
allow me to avail Plagiarism verification report for the document mentioned above.
Complete Thesis/Report Pages Detail:
 Total No. of Pages = 53
 Total No. of Preliminary pages = 11
 Total No. of pages accommodate bibliography/references = 2
(Signature of Student)
FOR DEPARTMENT USE
We have checked the thesis/report as per norms and found Similarity Index at (%).
Therefore, we
are forwarding the complete thesis/report for final plagiarism check. The plagiarism
verification report may be handed over to the candidate.

(Signature of Guide/Supervisor) Signature of HOD


FOR LRC USE
The above document was scanned for plagiarism check. The outcome of the same is
reported below:
Copy Received Excluded Similarity Generated Plagiarism
on Index Report Details
(%) (Title, Abstract & Chapters)

l All
Word Counts
Preliminary
Report Pages 15 Character Counts
Generated on l Bibliograph
y/Ima Submission Total Pages
ges/Quotes ID Scanned
l 14 Words
File Size
String

Checked by
Name & Signature Librarian
……………………………………………………………………………………………………………………………………………………………………………

Please send your complete thesis/report in (PDF) with Title Page, Abstract and Chapters in (Word File) through
the supervisor at [email protected]

You might also like