0% found this document useful (0 votes)

57 views20 pages

Malware Detection Using ML

Malware Detection

Uploaded by

Hasibul Hasan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views20 pages

Malware Detection Using ML

Malware Detection

Uploaded by

Hasibul Hasan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Malware Detection Using Machine Learning

1|Page
Table of Contents
1 Summary ......................................................................................................................................... 3

2 Background ......................................................................................................................... 3

3 Problem Statement.............................................................................................................. 4

4 Motivation and Significance............................................................................................... 4

5 Methodology ........................................................................................................................ 5
5.1 System Requirement ............................................................................................................... 5
5.1.1 Software Tools .............................................................................................................. 5
5.2 Modules................................................................................................................................. 5
5.3 System Diagram .................................................................................................................... 6
5.4 Algorithms Used ................................................................................................................... 6
5.4.1 Decision Trees ............................................................................................................... 6
5.4.2 Random Forest .............................................................................................................. 7
5.4.3 Logistic Regression ....................................................................................................... 8
5.4.4 Support Vector Machine (SVM) ...................................................................................... 9
5.5 WorkFlow ............................................................................................................................. 9
5.5.1 Data Collection .............................................................................................................. 9
5.5.2 Data Type .................................................................................................................... 10
5.6 Model Training .................................................................................................................... 10
5.7 Model Evaluation ................................................................................................................. 11
5.7.1 Model Accuracy .......................................................................................................... 11
5.7.2 Confusion Matrix And Classiﬁcation Report of diﬀerent Models .............................. 11

6 Results ................................................................................................................................ 18

7 Conclusion ......................................................................................................................... 18

8 Limitations ......................................................................................................................... 19

9 Future Enhancements ....................................................................................................... 19

References ...................................................................................................................................... 20

2|Page
1 Summary
This study harnesses supervised machine learning to enhance malware detection through dynamic
analysis of system activity logs. By executing both malicious and benign software within a secure
sandbox environment, specifically the Flare VM on Windows 10, extensive log files are generated
to capture unique behaviors exhibited by each type of software. Dynamic malware analysis offers a
comprehensive approach by observing malware's real-time actions, capturing system modifications,
registry changes, network requests, and other activities that static analysis may miss.
The process begins with log file generation from the controlled execution of both malware and
goodware in the sandbox. These log files undergo data preprocessing using the bag-of-words NLP
technique, a method that converts textual data into structured numerical features based on word
frequency, enabling feature extraction that reflects the behavior patterns of both benign and malicious
software. Each data point is then labeled to facilitate training within a supervised learning framework.
Subsequent model training is performed using four distinct machine learning algorithms—Random
Forest, Decision Tree, Logistic Regression, and Support Vector Machine (SVM). Each of these
algorithms contributes a unique approach to pattern recognition within the labeled data, with Random
Forest emerging as the highest-performing model. It achieves an impressive accuracy of 99.99806%,
highlighting its capability to discern nuanced behavioral differences with minimal error. This high
accuracy suggests that Random Forest is particularly well-suited for this application due to its ability
to handle large, complex datasets and to mitigate overfitting.

2 Background

In today’s increasingly digitized landscape, cybersecurity is critical for protecting sensitive data,
intellectual property, and private information from unauthorized access, theft, and manipulation.
Malware, a term derived from "malicious software," refers to software designed with the intent to
disrupt, damage, or gain unauthorized access to devices, networks, and data systems. Common types
of malware include viruses, worms, Trojan horses, spyware, adware, and ransomware, each posing
different threats and attack vectors. To analyze malware, two primary methods are used: static and
dynamic analysis. Static analysis involves examining the code without execution, making it quicker
but often less effective at revealing runtime behaviors. In contrast, dynamic analysis—used in this

3|Page
work—focuses on executing malware within a secure, isolated environment to capture its real-time
behavior. By running the malware in a sandbox environment, like Flare VM, researchers can observe
changes in the system, including file modifications, registry alterations, and network requests, and
log these activities for analysis. These log files serve as a foundation for data extraction and are
essential for training machine learning models to improve detection accuracy.

3 Problem Statement

With the rapid growth in both the volume and sophistication of cyber attacks, traditional
programming methods struggle to identify and mitigate malware efficiently. As malware becomes
more complex and evasive, these conventional approaches lack the flexibility to adapt quickly.
Machine learning (ML) offers a promising solution, leveraging large datasets to identify patterns,
classify anomalies, and generalize to new forms of malware. This study aims to apply ML techniques
to malware detection using log data from dynamic analysis, improving the detection process's
accuracy, adaptability, and efficiency.

4 Motivation and Significance

Cybersecurity breaches can cause severe financial and reputational damage to organizations,
impacting operations, customer trust, and compliance with regulatory standards. Given the critical
role cybersecurity plays in national and economic security, effective defenses are essential. Machine
learning is particularly well-suited for tackling the vast and complex nature of modern malware
threats. By analyzing extensive log files, machine learning models can overcome the scalability
limitations of traditional methods, making it a powerful tool in the fight against evolving cyber
threats. This research is motivated by the need to bolster cybersecurity defenses using advanced
machine learning techniques to support real-time, adaptable malware detection that can improve
overall system security in the face of ever-changing threats.

4|Page
5 Methodology
5.1 System Requirement

5.1.1 Software Tools

Malware Analysis Lab :

A well-equipped Malware Analysis Lab typically includes Kali Linux as the host operating system
for its robust cybersecurity tools, alongside Oracle VirtualBox for creating isolated virtual
environments. Within these virtual machines, Windows is installed and configured with Flare VM,
a platform tailored for malware analysis. Tools like ProcMon are used to monitor system calls and
registry changes. The lab utilizes malware samples sourced from VirusBazaar and GitHub,
complemented by goodware samples for comparison. This setup enables safe and effective analysis
of malware behavior in a controlled environment.

1) Kali Linux Host (preferable)

2) Oracle Virtual box

3) Window

4) Flare VM

5) ProcMon

6) Malware samples from VirusBazaar and github

7) Goodware samples

5.2 Modules

1) Pandas
2) Numpy
3) Seaborn
4) Matplotlib

5|Page
5) Sklearn

5.3 System Diagram

Fig 1 : System For Log File Generation

5.4 Algorithms Used

5.4.1 Decision Trees

A Decision Tree is a flowchart-like structure used for decision-making and classification,

where each internal node represents a test or decision based on a particular attribute. Each

branch of the tree corresponds to a possible outcome of the test, guiding the flow down to

subsequent nodes. Finally, each leaf node (or terminal node) represents a class label or

predicted outcome. Decision Trees are intuitive and interpretable, and they often yield

strong results by capturing patterns and relationships within data. However, they can be

prone to overfitting, especially with complex datasets, making them less reliable for

6|Page
generalization without further adjustments, such as pruning.

5.4.2 Random Forest

Random Forest is a supervised learning algorithm that creates a "forest" composed of

multiple decision trees to improve predictive accuracy and control overfitting. Each
tree in the forest is trained on a subset of the data, and the final output is derived by
aggregating the predictions of all individual trees. This ensemble approach allows
Random Forest to produce reliable, high-quality results even without extensive
hyperparameter tuning, making it a straightforward yet powerful tool in machine
learning. By combining the results of numerous decision trees, Random Forest
typically achieves significantly better accuracy compared to using a single decision
tree, benefiting from the diverse perspectives each tree brings to the model.

Fig 2: Random Forest Diagram

7|Page
Advantages
1. Overcomes Overfitting: By averaging or combining the predictions of multiple decision
trees, random forests reduce the risk of overfitting, leading to a more robust model.
2. Handles Large Data Ranges: Random forests perform better with a wide variety of data
points compared to a single decision tree, making them suitable for complex datasets.
3. Reduced Variance: Compared to individual decision trees, random forests exhibit lower
variance, resulting in more stable and reliable predictions.
4. High Flexibility and Accuracy: Random forests are highly flexible and capable of adapting
to various data structures, achieving consistently high accuracy levels.
5. No Need for Data Scaling: Unlike many algorithms, random forests do not require data
scaling, simplifying the preprocessing stage.
6. Handles Missing Data: Random forests maintain strong accuracy even when a significant
portion of the data is missing, making them resilient to incomplete datasets.

Disadvantages

1. High Storage Requirements: Random forest models can be memory-intensive due to the
large number of decision trees they generate, requiring substantial storage space.
2. Time and Computation Intensive: Building and evaluating random forest models can be
computationally expensive and time-consuming, especially with large datasets and many
trees.

5.4.3 Logistic Regression

Logistic regression is a supervised classification algorithm primarily used for binary

classification problems. In this context, the target variable (or output), denoted as yyy,
can assume only discrete values based on a given set of features (or inputs), represented
as XXX. This makes logistic regression particularly well-suited for scenarios where the

8|Page
goal is to classify data points into one of two categories, such as distinguishing between
malware and goodware. The algorithm works by estimating the probability that a given
input belongs to a particular class using the logistic function, which outputs values
between 0 and 1. By applying a threshold (typically 0.5), logistic regression effectively
converts these probabilities into class labels, making it a valuable tool in the realm of
machine learning for tasks requiring clear, interpretable classification outcomes.

5.4.4 Support Vector Machine (SVM)

Support Vector Machine (SVM) is a powerful supervised learning algorithm used for classification tasks.
It works by constructing a hyperplane (or a line in two-dimensional space) that optimally separates
different classes of data points. The goal of SVM is to find the hyperplane that maximizes the margin
between the closest points of the classes, known as support vectors. However, SVM is less effective in
situations where classes overlap significantly or when there is a high level of noise in the data, as this can
lead to poor generalization. Despite these limitations, SVM is particularly suitable for datasets with a
large number of features, making it a good choice for complex problems such as malware detection,
where distinguishing between malicious and benign samples may involve many different characteristics.

5.5 WorkFlow

5.5.1 Data Collection

Log ﬁle generation

o Setting up a Malware Analysis lab and taking a snapshot.

o Collecting Malware and Goodware Samples.

o Run malware samples and save the log ﬁles as malware1.csv.

o Restore the lab to the previous snapshot.

o Run goodware samples and save the log ﬁles as good1.csv.

Data Extraction from log ﬁles

9|Page
Data extraction from log files is performed using the Natural Language Processing (NLP) technique
known as Bag of Words. This approach involves transforming the textual data captured in the log
files into a structured format suitable for machine learning analysis. Specifically, we employ the
CountVectorizer from the Scikit-learn library to facilitate this extraction process.
The CountVectorizer works by converting a collection of text documents into a matrix of token
counts, effectively capturing the frequency of each word or feature present in the logs. This
transformation allows us to create a numerical representation of the data, which is essential for
feeding it into machine learning models.
In addition to data extraction, the labeling of the dataset occurs concurrently. Each entry is labeled
as either 1 for goodware (benign software) or -1 for malware, based on the nature of the
corresponding log entry. This dual process of data extraction and labeling ensures that the dataset is
well-prepared for subsequent analysis, enabling accurate model training and evaluation in the context
of malware detection.

5.5.2 Data Type

The final dataset utilized in this study is in CSV format, consisting of 34,371 rows and 501
columns. Each entry in the dataset is labeled to indicate its classification: a label of 1 denotes
goodware (benign software), while a label of -1 indicates malware. This structured format allows
for efficient processing and analysis, enabling the application of various machine learning
algorithms for malware detection and classification. The comprehensive nature of the dataset, with
a significant number of features and instances, enhances the potential for accurate predictions and
effective model training.

5.6 Model Training

In this phase, four distinct machine learning models are trained using various algorithms: Decision
Tree, Random Forest, Logistic Regression, and Support Vector Machine (SVM). Each model is
developed to analyze the extracted features from the malware and goodware log files, aiming to
classify the software accurately.

10 | P a g e
1. Decision Tree: This model creates a flowchart-like structure to make predictions based on
feature values, allowing for easy interpretation and decision-making.
2. Random Forest: As an ensemble method, Random Forest combines multiple decision trees
to improve accuracy and robustness, effectively reducing the risk of overfitting.
3. Logistic Regression: Used for binary classification, this model estimates the probability of
a software sample being malware or benign based on input features, providing clear output
in the form of class labels.
4. Support Vector Machine (SVM): This model finds the optimal hyperplane that separates
the data into distinct classes, making it suitable for datasets with a large number of features.

Each of these algorithms contributes unique strengths to the training process, allowing for a
comprehensive evaluation of their performance in detecting malware based on the available dataset.
The effectiveness of each model will be assessed through various metrics, enabling the selection of
the most accurate approach for malware classification.

5.7 Model Evaluation

5.7.1 Model Accuracy

Model Accuracy

Decision tree 99.83514%

Random Forest 99.99806%

Logistic Regression 99.9806%

SVM 99.71877%

Tabel 1: Models and Their Accuracy

5.7.2 Confusion Matrix And Classiﬁcation Report of diﬀerent Models

11 | P a g e
5.7.2.1 Decision Tree

12 | P a g e
Fig 3: Confusion Matrix and Classiﬁcation Report of Decision Tree Model

5.7.2.2 Random Forest

13 | P a g e
14 | P a g e
Fig 4: Confusion Matrix and Classiﬁcation Report of Random Forest Model

5.7.2.3 Logistic Regression

15 | P a g e
Fig 5: Confusion Matrix and Classiﬁcation Report of Logistic Regression Model

5.7.2.4 SVC

16 | P a g e
17 | P a g e
Fig 6: Confusion Matrix and Classiﬁcation Report of SVM Model

6 Results

In this study, we developed and evaluated four different machine learning models for malware
detection, all of which demonstrated impressive accuracy rates exceeding 99%. Among these, the
Random Forest model achieved the highest accuracy, indicating its superior performance in
classifying malware and goodware samples. Following closely behind was the Decision Tree model,
which also showed strong results. The Logistic Regression model followed, demonstrating reliable
classification capabilities as well. Lastly, the Support Vector Machine (SVM) model, while still
accurate, ranked slightly lower in performance compared to the others. Overall, the results highlight
the effectiveness of these machine learning algorithms in accurately detecting malware, with minimal
differences in accuracy among the models.

7 Conclusion

The primary objective of this project was to leverage machine learning techniques for the detection
of malware within systems. By developing and evaluating multiple models, we have demonstrated
that machine learning can effectively identify infected systems, significantly enhancing
cybersecurity measures at both individual and organizational levels. The high accuracy rates
achieved by the various models highlight the potential of machine learning to provide reliable and
efficient malware detection solutions. As cyber threats continue to evolve, integrating such
advanced techniques into cybersecurity frameworks can help organizations proactively safeguard
their systems and data against malicious attacks. Ultimately, this project underscores the importance
of utilizing machine learning as a vital tool in the ongoing battle against malware and cyber threats.

18 | P a g e
8 Limitations

The malware detection model developed in this project has several limitations. Firstly, it can only detect
malware that it has been specifically trained on, which may limit its effectiveness against new or
unknown threats. Additionally, while the model excels at identifying malware, it does not offer
protection against such threats or provide capabilities for malware removal. The training data used could
have encompassed a broader range of malware types to enhance its robustness. Furthermore, this model
is primarily designed for Windows operating systems, which restricts its applicability to other
environments, such as Linux or macOS.

9 Future Enhancements

To address these limitations and improve the overall effectiveness of the malware detection system,
several future enhancements can be considered:

• Training on a Wider Variety of Malware: Expanding the training dataset to include more
diverse types of malware will help enhance detection capabilities and generalization to new
threats.
• Integration for Live Operation: Developing a system that integrates the model to work in
real-time can enable proactive malware detection and immediate response to threats.
• Combining Multiple Models: Implementing ensemble techniques by combining numerous
models may improve accuracy and robustness, leveraging the strengths of each algorithm.
• Developing Cross-Platform Models: Building similar detection models tailored for
different operating systems, such as Linux and macOS, will broaden the applicability of the
solution.
• Creating an Unsupervised Version: Exploring the development of an unsupervised learning
approach for malware detection could allow the model to identify novel threats without
relying on labeled training data.

19 | P a g e
References

1) Baker, K. (2022, January 4). Malware analysis explained. CrowdStrike.

https://www.crowdstrike.com/cybersecurity-101/malware/malware-analysis/
2) HackeSploit. (2019, August 10). Malware analysis bootcamp - Setting up our environment
[Video]. YouTube. https://www.youtube.com/watch?v=F1LE56QQ7iA
3) MalwareBazaar. (n.d.). Malware samples. https://bazaar.abuse.ch/
4) Mandiant. (2021, October 23). Flare-VM version 3.0 [Software]. GitHub.
https://github.com/mandiant/flare-vm
5) Ricardo, C. (2019, August 5). Machine learning for cyber security: Lectures [Video
playlist]. YouTube.
https://www.youtube.com/watch?v=JxcBm7CRtI0&list=PL74sw1ohGx7GHqDHCkX
ZeqMQBVUTMrVLE
6) Ricardo, C. (2019, June 25). Machine learning for cyber security: Labs [Video playlist].
YouTube. https://www.youtube.com/watch?v=lTge-G02Cis&list=PL74sw1ohGx7FE-
DI18bOﬁ2X61zRE-wMd
7) Vectra AI. (2018, June 15). Machine learning fundamentals for cyber security pros
[Video]. YouTube. https://www.youtube.com/watch?v=uPSgfNhd2qY
8) Virus-Samples. (2021, February 6). Malware-sample-sources [Malware samples].
GitHub. https://github.com/Virus-Samples/Malware-Sample-Sources
9) McAfee. (n.d.). What is malware? https://www.mcafee.com/en-
in/antivirus/malware.html
10) Cisco. (n.d.). What is malware?
https://www.cisco.com/c/en_in/products/security/advanced-malware-protection/what-
is-malware.html

20 | P a g e

Learning Malware Analysis
No ratings yet
Learning Malware Analysis
113 pages
OWASP Amass - A Solid Information Gathering Tool
No ratings yet
OWASP Amass - A Solid Information Gathering Tool
44 pages
Cryptography and Network Security Digital Notes
No ratings yet
Cryptography and Network Security Digital Notes
187 pages
Secure Software Architecture and Design
No ratings yet
Secure Software Architecture and Design
37 pages
David Altman - Direct Democracy in Comparative Perspective. Origins, Performance, and Reform-Cambridge University Press (2019) PDF
No ratings yet
David Altman - Direct Democracy in Comparative Perspective. Origins, Performance, and Reform-Cambridge University Press (2019) PDF
271 pages
End-to-End Machine Learning Project (Bootcamp)
No ratings yet
End-to-End Machine Learning Project (Bootcamp)
415 pages
SAGE Title - List - 2019 - Books - and - Reference
0% (2)
SAGE Title - List - 2019 - Books - and - Reference
242 pages
CSDF Endsem
100% (1)
CSDF Endsem
33 pages
Database Terminologies
No ratings yet
Database Terminologies
13 pages
Role of Operating System PDF Computer Science
100% (1)
Role of Operating System PDF Computer Science
7 pages
Dos Attack (3 PDF
No ratings yet
Dos Attack (3 PDF
21 pages
Practical Malware Analysis Based On Sandboxing
No ratings yet
Practical Malware Analysis Based On Sandboxing
6 pages
A Brief Review On Linux: Index
No ratings yet
A Brief Review On Linux: Index
6 pages
Penetration Testing With Shellcode
No ratings yet
Penetration Testing With Shellcode
336 pages
ch03 ProgramSecurity 2f
No ratings yet
ch03 ProgramSecurity 2f
46 pages
Data Science in Finance
No ratings yet
Data Science in Finance
83 pages
Group 2
No ratings yet
Group 2
18 pages
Relationship Between HR Practices and Employee Engagement in Indian Insurance Companies
No ratings yet
Relationship Between HR Practices and Employee Engagement in Indian Insurance Companies
10 pages
Chapter 4 - Operating System and Security
No ratings yet
Chapter 4 - Operating System and Security
18 pages
Chapter 5. Computer Security
No ratings yet
Chapter 5. Computer Security
27 pages
Chapter 7
No ratings yet
Chapter 7
26 pages
LAB3
0% (1)
LAB3
21 pages
Programming 2 (Structured Programming) : Worktext in ITC 106
100% (1)
Programming 2 (Structured Programming) : Worktext in ITC 106
24 pages
SQL Injection Attack Lab
No ratings yet
SQL Injection Attack Lab
8 pages
Updated ZeroLecture INT251
No ratings yet
Updated ZeroLecture INT251
22 pages
Project Report For Intrusion Detection System Using Fuzzy Clustring Algorithm
100% (1)
Project Report For Intrusion Detection System Using Fuzzy Clustring Algorithm
48 pages
Ch-6-Functional Organization
No ratings yet
Ch-6-Functional Organization
60 pages
System Administration One Past Question PDF
No ratings yet
System Administration One Past Question PDF
4 pages
CS3451 OS Unit 5 Notes
No ratings yet
CS3451 OS Unit 5 Notes
25 pages
Project Report Final
No ratings yet
Project Report Final
39 pages
Car Transport Prediction
100% (2)
Car Transport Prediction
27 pages
Module 1-6 Cosc 203
100% (1)
Module 1-6 Cosc 203
37 pages
Basics of Credit Risk Modelling
100% (1)
Basics of Credit Risk Modelling
13 pages
Rootkits Part2
No ratings yet
Rootkits Part2
69 pages
Untitled
No ratings yet
Untitled
1,326 pages
Programming With Tensorflow Solutions For Edge Computing Applications
No ratings yet
Programming With Tensorflow Solutions For Edge Computing Applications
190 pages
Information Technology Unit 4
No ratings yet
Information Technology Unit 4
9 pages
Installing Virtual Box
No ratings yet
Installing Virtual Box
46 pages
Computer Project Topics
No ratings yet
Computer Project Topics
8 pages
Notes Unit 4 Computer Security
No ratings yet
Notes Unit 4 Computer Security
13 pages
Prac 2
No ratings yet
Prac 2
33 pages
Title of Assignment: Security Vulnerabilities and Countermeasures in
No ratings yet
Title of Assignment: Security Vulnerabilities and Countermeasures in
19 pages
6 TheRealTimeFaceDetectionandRecognitionSystem
No ratings yet
6 TheRealTimeFaceDetectionandRecognitionSystem
48 pages
Linux+Guide To Linux Certification 3rd Ed PDF
No ratings yet
Linux+Guide To Linux Certification 3rd Ed PDF
42 pages
IITK Malware Problem Final PDF
No ratings yet
IITK Malware Problem Final PDF
5 pages
Spectre PDF
No ratings yet
Spectre PDF
16 pages
Embedded Systems Notes
No ratings yet
Embedded Systems Notes
6 pages
Data Science & ML Syllabus
No ratings yet
Data Science & ML Syllabus
12 pages
Computer Network Lab Manual
No ratings yet
Computer Network Lab Manual
43 pages
Introduction To Os
No ratings yet
Introduction To Os
34 pages
Compilers and Interpreters
No ratings yet
Compilers and Interpreters
10 pages
Module-1 Introduction To File Structures
No ratings yet
Module-1 Introduction To File Structures
50 pages
INT250
No ratings yet
INT250
2 pages
Swap Space in Operating System
No ratings yet
Swap Space in Operating System
20 pages
Unit 38 DatabaseManagementSyst
No ratings yet
Unit 38 DatabaseManagementSyst
27 pages
ML Unit 2
No ratings yet
ML Unit 2
53 pages
Familiarization of Linux Operating System and Commands
No ratings yet
Familiarization of Linux Operating System and Commands
3 pages
EXPERIMENT 5 - Packet Tracer - Build A Peer To Peer Networking
No ratings yet
EXPERIMENT 5 - Packet Tracer - Build A Peer To Peer Networking
5 pages
"Library Management System": A Report ON
No ratings yet
"Library Management System": A Report ON
41 pages
1.1.1.5 Lab - Cybersecurity Case Studies
No ratings yet
1.1.1.5 Lab - Cybersecurity Case Studies
2 pages
Designing ProLog
No ratings yet
Designing ProLog
17 pages
Sad
No ratings yet
Sad
11 pages
B3-201-2018 - Developing and Using Justifiable Asset Health Indices For Tactical and Strategic Risk Management
No ratings yet
B3-201-2018 - Developing and Using Justifiable Asset Health Indices For Tactical and Strategic Risk Management
10 pages
Feasibility Study SaaS
No ratings yet
Feasibility Study SaaS
15 pages
Lecture 1 Information Security Design
No ratings yet
Lecture 1 Information Security Design
55 pages
Hands-On Ethical Hacking and Network Defense: Linux Operating System Vulnerabilities
No ratings yet
Hands-On Ethical Hacking and Network Defense: Linux Operating System Vulnerabilities
40 pages
Solution Forensic
No ratings yet
Solution Forensic
15 pages
IRIS - Flower
No ratings yet
IRIS - Flower
10 pages
Regresi Logistik - Bahan
No ratings yet
Regresi Logistik - Bahan
89 pages
Nymble Final
100% (1)
Nymble Final
77 pages
Poo
No ratings yet
Poo
8 pages
Fernando, Logit Tobit Probit March 2011
No ratings yet
Fernando, Logit Tobit Probit March 2011
19 pages
DP Final
No ratings yet
DP Final
122 pages
Proliferation of Airlines and Its Customer Satisfaction 417 PDF
No ratings yet
Proliferation of Airlines and Its Customer Satisfaction 417 PDF
63 pages
Question Text: Complete Mark 1.00 Out of 1.00
No ratings yet
Question Text: Complete Mark 1.00 Out of 1.00
10 pages
Remote Desktop
No ratings yet
Remote Desktop
90 pages
Assignment 2 - PPP and Logistic
No ratings yet
Assignment 2 - PPP and Logistic
8 pages
Discovering Statistics Using IBM SPSS Statistics 4th Edition (Ebook PDF) PDF Download
No ratings yet
Discovering Statistics Using IBM SPSS Statistics 4th Edition (Ebook PDF) PDF Download
55 pages
Default Probabilities in A Corporate Bank Portfolio: A Logistic Model Approach
No ratings yet
Default Probabilities in A Corporate Bank Portfolio: A Logistic Model Approach
12 pages
Unit 3 Machine Learning
No ratings yet
Unit 3 Machine Learning
12 pages
STAT2110 Instructions For Practical Work 2021
No ratings yet
STAT2110 Instructions For Practical Work 2021
4 pages
Big Data Projecct
No ratings yet
Big Data Projecct
12 pages
Major Project Report 2023-2024
No ratings yet
Major Project Report 2023-2024
33 pages
Abstracts and Author Profile SUPRE 2022
No ratings yet
Abstracts and Author Profile SUPRE 2022
24 pages
Etiopia
No ratings yet
Etiopia
17 pages
Group 3 Written Report
No ratings yet
Group 3 Written Report
16 pages
The Effect of Earning Management, Profitability, and Firm Sizeon Audited Financial Statement Timeliness
No ratings yet
The Effect of Earning Management, Profitability, and Firm Sizeon Audited Financial Statement Timeliness
11 pages
Logit & Probit Theo Sheet
No ratings yet
Logit & Probit Theo Sheet
6 pages
2marks ML
No ratings yet
2marks ML
3 pages
History and Development of Operating Systems
From Everand
History and Development of Operating Systems
Steven Ferraro
No ratings yet

Malware Detection Using ML

Uploaded by

Malware Detection Using ML

Uploaded by

Malware Detection Using Machine Learning

4 Motivation and Significance............................................................................................... 4

9 Future Enhancements ....................................................................................................... 19

4 Motivation and Significance

5.1.1 Software Tools

Malware Analysis Lab :

1) Kali Linux Host (preferable)

2) Oracle Virtual box

6) Malware samples from VirusBazaar and github

5.3 System Diagram

Fig 1 : System For Log File Generation

5.4 Algorithms Used

5.4.1 Decision Trees

A Decision Tree is a flowchart-like structure used for decision-making and classification,

5.4.2 Random Forest

Random Forest is a supervised learning algorithm that creates a "forest" composed of

Fig 2: Random Forest Diagram

5.4.3 Logistic Regression

Logistic regression is a supervised classification algorithm primarily used for binary

5.4.4 Support Vector Machine (SVM)

5.5.1 Data Collection

Log ﬁle generation

o Setting up a Malware Analysis lab and taking a snapshot.

o Collecting Malware and Goodware Samples.

o Run malware samples and save the log ﬁles as malware1.csv.

o Restore the lab to the previous snapshot.

o Run goodware samples and save the log ﬁles as good1.csv.

Data Extraction from log ﬁles

5.5.2 Data Type

5.6 Model Training

5.7 Model Evaluation

5.7.1 Model Accuracy

Decision tree 99.83514%

Random Forest 99.99806%

Logistic Regression 99.9806%

Tabel 1: Models and Their Accuracy

5.7.2 Confusion Matrix And Classiﬁcation Report of diﬀerent Models

5.7.2.2 Random Forest

5.7.2.3 Logistic Regression

1) Baker, K. (2022, January 4). Malware analysis explained. CrowdStrike.

You might also like