0% found this document useful (0 votes)
13 views

Malware_Detection_Using_Machine_Learning (1)

Maleware Detection System

Uploaded by

yaskalai1602
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Malware_Detection_Using_Machine_Learning (1)

Maleware Detection System

Uploaded by

yaskalai1602
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

2021 International Conference on Technological Advancements and Innovations (ICTAI)

Malware Detection Using Machine Learning


Prabhat Singh Sakshi Kaur Shivani Sharma
Assistant Professor Final Year Student Final Year Student
Department of Computer Science & Department of Computer Science & Department of Computer Science &
2021 International Conference on Technological Advancements and Innovations (ICTAI) | 978-1-6654-2087-7/21/$31.00 ©2021 IEEE | DOI: 10.1109/ICTAI53825.2021.9673465

Engineering Engineering Engineering


ABES Engineering College ABES Engineering College ABES Engineering College
Dr. APJ Abdul Kalam University Dr. APJ Abdul Kalam University Dr. APJ Abdul Kalam University
Lucknow, India Lucknow, India Lucknow, India
[email protected] [email protected] [email protected]

Gitika Sharma Swati Vashisht Vinay Kumar


Final Year Student Amity University Tathagat Gautam Buddh Government
Department of Computer Science & Greater Noida Campus, Polytechnic
Engineering Uttar Pradesh Sirsiya,
ABES Engineering College India Shravasti.
Dr. APJ Abdul Kalam University [email protected] [email protected]
Lucknow, India
[email protected]

Abstract - Considering all the researches done, it appears that


over last decade, malware has been growing exponentially and The antivirus systems have been known to detect malicious
also has been causing significant financial losses to different contents in the system but Malware detection using machine
organizations. Thus, it becomes important to detect if a file learning will boost up its power more. The standard
contains any malware or not. The malwares can cause a lot of
detection systems by antivirus vendors have an accuracy of
damage to the system such as slowing down the system and also
stealing sensitive information from the system. In the current about 90% but if our approach is used, this accuracy can be
times, one of the most important assets of the people is their improved by about 3 or 4%.
data and information which needs to be protected. Hence, in
order to protect the data and information, there is a need for So, the main objective of our project is to scan the file
software which could perform this task and help in ensuring provided and detect if it contains any kind of malicious
the integrity of our system. Our method for malware detection content or not. Hence, the project primarily focuses on the
uses different machine learning algorithms such as decision detection of any kind of malicious content in the files
tree, random forest etc. The algorithm which has the maximum provided.
accuracy gets selected which provides a great detection ratio
for the system. Furthermore, the performance of the system is
detected by calculating the false positive and false negative II. LITERATURE REVIEW
rates using the confusion matrix.
According to Sanjay Sharma, C. Rama Krishna and Sanjay
Keywords - Machine Learning, Malware detection, Cyber K. Sahay [1] in the present scenario of digital world, we
Security, Malware, Confusion matrix, Feature selection, Feature deal with the many anti-malware tools for the malware
transformation, Legitimate, Gradient boosting, Random forest, detection but these are based on the signatures which is
Adaboost, Decision Tree ineffective to detect the advanced unknown malware viz.
metamorphic malware.
I. INTRODUCTION
The opcode frequency has been studied to detect the
In today’s world, one of the most important assets of the malware by using the machine learning algorithms and
people is their data and information. Since they are this techniques. The algorithms which are used to detect the
important to people, their security is a must and they must malware achieved the maximum accuracy, so for depth
be protected at any cost. Malware is basically a software analysis five classifiers are selected and the algorithm and
which intends to cause a damage to the computer system, techniques are LMT, NBT, J48 Graft, Random forest and
server or any network. It can thus be installed in a lot of Random Tree, etc.
ways such as phishy emails, any kind of infected
attachment, infected links etc. According to [3] malware is basically a code that is
generated by an attacker with an intention to cause harm to
So, in order to keep our system safe and protected we must the users systems. There can be different variants of
remove all the files containing malwares which makes the malware such as backdoor, virus, Rootkit, ransom ware,
malware detection an urgent need of the hour. Malware worm, adware etc.
detection can thus help in the protection of a lot of sensitive
information in the systems of people and also enhance the
integrity of their systems.

11
978-1-6654-2087-7/21/$31.00 ©2021 IEEE

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on November 25,2024 at 06:25:59 UTC from IEEE Xplore. Restrictions apply.
2021 International Conference on Technological Advancements and Innovations (ICTAI)
Nearly 3,50,000 new types of malicious codes harm various methods were not beneficial when virus mutator kits
applications. They aim to present the literature work of all appeared, as these mutation kits made the virus appear very
the previously written papers and all the existing works different from its true form.
which have been done in the field of malware detection
using machine learning. Pramod Subramanyan, Zhixing Xu, Sayak Ray and Sharad
Malik [2] proposed a different malware scenario where one
In the D parameter is responsible for controlling the model is used for each application that separates the
strictness of the system for the classification process as legitimate executions from executions infected with
Benign or malware. The value of N was varied as 2, 4, 6 and malware. The algorithms used are logistic regression, SVM
8. The detection ratio best achieved was voice 74.37% (support vector machine) and random forest. Histogram bin
where the value of N was 4, K was 17 as well as D was 17. size needs to be chosen carefully.

According to [7] scanners of first-generation use In [6], the approaches from machine learning as well as data
fundamental approaches to detect viruses. These methods mining majorly text classification have been used. The N-
involve scanning for provided sequences of bytes known as grams also have been deduced from different executable in
strings. Wildcards supported by scanners are allowed to the form of a Boolean attribute.
miss bytes or byte ranges. Simple string matching detection

TABLE I. COMPARATIVE ANALYSIS OF RESEARCH PAPERS


S.No Year Title Algorithms used Gaps in work
1 2019 Detection of advanced malware by Machine Learning Decision Tree, Random Forest, Use of signature-based method which is
techniques Naive Bayes, J48 Graft traditional and does not provide best
accuracy.
2 2017 Malware Detection using Machine Learning Based SVM, Random forest Human input is a necessity which limits
Analysis of Virtual Memory Access Patterns automation, Size of histogram is to be
chosen carefully.
3 2020 Classification Of Malware Detection using Machine Naive bayes, support vector Only focuses on the use of machine
Learning Algorithms machine, random forest, K- learning algorithms for malware
nearest neighbor detection.
4 2009 N-Grams based file signatures for malware detection KNN algorithm Good detection ratio is only achieved for
higher values of N.

5 2008 Learning and Classification of Malware Behavior Support Vector Machine Relies on single program execution of a
malware binary.
6 2006 Learning to Detect and Classify Malicious Naive bayes, Support vector The relative performance of methods
Executable in the Wild machine used in this paper was not as good as the
previous one.

7 2008 Metamorphic Virus: Analysis and Detection Random decryption algorithm Some viruses cannot be detected even in
(RDA) an emulated environment

8 2006 Machine Learning for Computer Security Adaptive statistical compression The adversary can defeat the computer
algorithms that learns how to extract signatures for
detecting computer worms.
9 2019 Malware Detection using Machine Learning and Random forest and KNN Does not apply any recurrent neural
Deep Learning Algorithm networks for malware detection.
10 2017 Malware detection using Machine Learning SVM, Decision Tree, Naive Bayes Use of signature-based method which is
Algorithms and Multi-Naive Bayes Algorithm traditional.
11 2017 Malware Detection and Evasion with Machine Heuristic, Artificial Intelligence, Use of traditional methods for malware
Learning Techniques: A Survey Behavior, Signature Based detection.
Methods
12 2012 Malware Detection Module using Machine Learning Decision Tree, Random Forest, Some methods of machine learning are
Algorithms to Assist in Centralized Security in Naive Bayes not appropriate due to heavy processors.
Enterprise Networks

III. PROPOSED SYSTEM

Our system is basically divided into three major modules the The user interface module is the front-end module and this
first one is the user interface the second one is the train module basically contains the front-end architecture of the
module and the third one is the malware test module.

12

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on November 25,2024 at 06:25:59 UTC from IEEE Xplore. Restrictions apply.
2021 International Conference on Technological Advancements and Innovations (ICTAI)
system. It basically provides an interface to the user for IV. IMPLEMENTATION
entering the file that is to be checked for malicious content.
The implementation for the project has been done by
The next module is the train module. This module is used to making the use of machine learning technologies. The
train as well as test the selected models. The model to be programming language that has been used for the
used is selected according to the accuracy of each. implementation is Python 3. The back and technology that
has been used is machine learning. The front-end
This module is the main module and is responsible for the technology that has been used is tkinter GUI.
final classification result. In this module the classifier for the
model Also gets generated. The implementation involves working upon three major
modules of the project which includes two back-end
The third module is the malware test module. This module modules and one front-end module.
is used to extract the data from the file that has been
uploaded by the user through the user interface. The backend modules our malware test and train. The
frontend module is the user interface module.
It is basically responsible for the extraction and
determination of the data from the file, uploading and as The proper implementation for the project can be explained
well as the dividing of the data into various sections or in a series of steps which have been described in a flowchart
features. which is the process of understanding.

The architecture consists of mainly three modules that are:


1. Feature Database. 2. Feature Selection as well as
Transformation and 3. Learning the Algorithms.

First, we discuss about the dataset, for this project we used


Kaggle Microsoft malware classification challenge dataset
which is a csv (comma separated file). Firstly, the collection of the data set is done. This collection
can be done by surfing the web and using websites such as
Then in next step there are some methods which are used to kaggle. After the data is collected the feature Selection and
select the features like chi-square, information gain, fisher transformation is done for the data site.
score, gain ratio and symmetric uncertainty feature selection
methods. After feature selection and transformation an important
After feature selection and transformation of dataset, it will process has done which is known as feature importance.
split into two datasets: - first one is Testing Dataset and
another is Training Dataset. In this we used various Feature importance is basically process that is used to
algorithms in detection of the unknown malware in the file. identify the features which are the most important.

Then the final step of the architecture is classification of the It means the features which have the most impact on the
results, in proposed approach Random forest, Decision Tree, database or the system. After this the data set is split into the
Linear Regression, Adaboost detect malware with much two sets which are:
accuracy and improve the efficiency.
Training dataset (80% of the dataset):

13

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on November 25,2024 at 06:25:59 UTC from IEEE Xplore. Restrictions apply.
2021 International Conference on Technological Advancements and Innovations (ICTAI)
This shows that the accuracy achieved for our system is
• This portion of the dataset is basically used for about 99 percent which is good accuracy in order to detect
training the dataset. Using this dataset, the model the malware. So it can be described that the results that were
basically learns. produced by your system are the accuracy which is 99%,
false positive rate which is 0.104%, false negative rate
Testing dataset (20% of the dataset): which is 0.154%.

This portion of the dataset is basically used for testing the A classification approach can be additionally implemented
dataset. Using this dataset, the model is tested. The accuracy for the malware detection system presented which will
of the model is thus determined using the testing dataset. involve the correct identification of the type of malware that
has attacked the file and can be used as a base for different
The percentage for the same is 80% for training data set and researches in order to identify the most commonly attacking
20% for testing data set. So, the test size is kept as 0.2. malwares. So, this presents and idea about the future work
for the project that can be implemented.

REFERENCES
[1] Sanjay K. Sahay, C. Rama Krishna, Sanjay Sharma1,“Detection
of Advanced Malware by Machine Learning Techniques:, 2019
[2] Pramod Subramanyan, ZhixingXu, Sayak Ray, Sharad Malik,
“Malware Detection using Machine Learning Based Analysis of
Virtual Memory Access Patterns”,2017
[3] R Mohanasundaram, P Harsha Latha, “Classification of
Malware Detection using Machine Learning Algorithms”, 2020
[4] Y. K. Penya, Santos, J. Devesa, P. G. Garcia, “N-Grams based
file signatures for malware detection”, 2009
According to our dataset the algorithm with the maximum [5] Thorsten Holz, Konrad Rieck, Carsten Willems, Patrick D¨ussel,
accuracy was Decision Tree. Pavel Laskov , “Learning and Classification of Malware
Behavior”, 2008
[6] J. Zico Kolter, Marcus A. Maloof, Learning to Detect and
Hence, it was selected to be used in the system. After this Classify Malicious Executable in the Wild”, 2006
that model is trained using the dataset. [7] Evgenios Konstantinou, “Metamorphic Virus: Analysis and
Detection”, 2008
[8] Philip K. Chan, Richard P. Lippmann “Machine Learning for
Then two files were generated. They were: Computer Security”, 2006
[9] Hemant Rathore, Swati Agarwal, Sanjay K. Sahay and Mohit
• classifier.pkl Sewak, "Malware Detection using Machine Learning and Deep
Learning",2019
• features.pkl [10] Mohd Tanveer Shaikh, Rafia Ansari, Mahenoor Suriya, Sonalii
Suryawanshi, “Malware detection using Machine Learning
After the classifier is ready we select the testing sample. The Algorithms”, Mohammad Danish Khan, 2017
testing sample is the selected from the testing data sent. [11] Jhonattan J. Barriga A. and Sang Guun Yoo, "Malware
Detection and Evasion with Machine Learning Techniques: A
Then the testing of the features is done by the help of Survey ", 2017
classifier. [12] Priyank Singhal, Nataasha Raul, "Malware Detection Module
using Machine Learning Algorithms to Assist in Centralized
If the file is malicious then the output is displayed as Security in Enterprise Networks", 2012
malicious otherwise they output is displayed as legitimate.

V. CONCLUSION AND FUTURE WORK

In order to keep our systems safe we need to ensure that


there are no files which contain any kind of malware. So we
implemented our system in order to detect such kind of
malware. While implementing the train module we applied
multiple algorithms on a dataset in order to achieve the best
possible accuracy for our system. We selected the model
with the best accuracy for a dataset.

So according to the results achieved that was shown in the


table it can be determined that the random forest provides
the best accuracy on the dataset. Hence the classifier for
random forest is generated.

14

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on November 25,2024 at 06:25:59 UTC from IEEE Xplore. Restrictions apply.

You might also like