Malwaredetection 07
Malwaredetection 07
MACHINE LEARNING
A Minor Project Report Submitted in Partial Fulfilment of the Requirements for the Award of
Degree Of
BACHELOR OF TECHNOLOGY
IN
BY
A. Chandana (21C31A5604)
N. Kavya (21C31A5636)
M.Bhanu prakash (21C31A5628)
V.Harsha Vardhan chary(21C31A5657)
S. SRAVANTHI
Asso. Prof Dept of CE(SE)
(SOFTWARE ENGINEERING)
CERTIFICATE
This is to certify that A.Chandana (21C31A5604) along with N.Kavya (21C31A5636),
M.Bhanu prakash(21C31A5628),V.Harsha Vardhan chary(21C31A5657) of B.Tech (CSW
IV/I) has satisfactorily completed the Major project work entitled “PROJECT NAME” in the
partial fulfilment of the requirements of the B. Tech degree during this academic year 2024-
2025.
External Examiner
BALAJI INSTITUTE OF TECHNOLOGY & SCIENCE
Laknepally, Narsampet, Warangal (Rural)-506331, Telangana State, India
(Autonomous)
Accredited by NBA (UG-CE, EEE, ECE, ME & CSE Programmes) & NAAC A+ Grade
(Affiliated to JNTU Hyderabad and Approved by the AICTE, New Delhi)
(SOFTWARE ENGINEERING)
The result of investigation enclosed in the report have been verified and found
satisfactory. The results embodied in this thesis have not been submitted to any other University
for the award of degree or diploma.
I am also thankful to Mr. S. Santhosh Kumar Asst. Prof, Project Coordinator for
providing the excellent facilities, motivation and valuable guidance throughout the project
work. With his co-operation and encouragement, I completed the project work in time.
I take this opportunity to express my deep and sincere gratitude to the project guide
S.Sravanthi Balaji Institute of Technology & Science.
Last but not least I would like to express my deep sense of gratitude and earnest thanks
giving to my dear parents for their moral support and heartfelt co-operation in doing the project.
I would also like to thank all the teaching and non-teaching staff and my friends, whose who
direct or indirect help has enabled us to complete this work successfully.
A.Chandana (21C31A5604)
N.Kavya (21C31A5636)
M.Bhanu prakash (21C31A5628)
V.Harsha Vardhan chary (21C31A5657)
ABSTRACT
Malware is malicious code that remains undetected by the user and enables attacks cause substantial
harm to electronic devices. Malicious software can be a silent computer program which damages
the computer and keeps on increasing in number with time constituting a danger to the protection
of the Internet threats. There will always be a ceaseless war going on between digital security
professionals and malware developers. The development of malicious software co-exists with
advances in general computer technologies. Today most of the research is done on the development
and application of machine learning techniques for malware detection and classification. Machine
learning can become a gamechanger for cyber security and malware detection.In this period project
different malware analysis and classification methods are studied and compared to find the accuracy
of various machine learning algorithms such as decisions, random forest, Gradient boosting,
logistic regression, CNN, DNN, LSTM, SVM, Naïve Bayes etc. Also, a new system will be
proposed based on both static and dynamic techniques along with different classification
techniques.The rapid growth of malware threads has necessitated the development of robust
detection systems to protect computer systems.The model uses a combination of static and dynamic
features to detect features malware with high accuracy and low false positives.Additionally the
system employs behaviour and detect anomiles indicative of malware activity. The system
evaluated on a dataset of real-world malware samples and demonstrate superior detection
performance compared to existing solutions for real-time and effective solution for real-time
malware detection and can be integrated into existing security frameworks to enhance overall
system security.
TABLE OF CONTENTS
Figures Page No
1. Malware Identifier 3
2. Class Diagram 19
3. Use Case 20
4. Activity Diagram 21
5. State Diagram 22
6. Sequence Diagram 23
7. Deployment Diagram 24
8. Final Output screen-1 42
9. Final Output Screen-2 43
1. INTRODUCTION
We are building a system to detect bad software It uses special algorithms to find
malware quickly and accurately. It can even find new malware that hasn't been seen before.It's
designed to grow with your network. It's easy to use and understand.It works with other security
tools. It gets better and learns over time. It helps keep your network safe from attacks. It finds
malware and stops it from causing harm. It's a powerful tool to protect your computer systems.
The goal of this project is to design and develop a machine learning-based system for
detecting and classifying malware in real-time. The system will utilize advanced algorithms
and techniques to identify zero-day malware and evolving threats, providing accurate and
reliable detection and prevention.
1
improve the design process of such detectors, since they reveal characterizing patterns, thus
guiding the human expert towards the understanding of the most relevant features.
While classic malware has focused on desktop systems and the Windows platform,
recent attacks have started to target smartphones and mobile platforms, such as Android. In
this chapter, we investigate a recent threat of this development, namely Android ransomware.
The detection of such a threat represents a challenging, yet illustrative domain for assessing
the impact of explainability. Ransomware acts by locking the compromised device or
encrypting its data, then forcing the device owner to pay a ransom in order to restore the
device functionality. Scales et al. Have shown that ransomware developers typically build
such dangerous apps so that normally-legitimate components and functionalities (e.g.,
encryption) perform malicious behaviour; thus, making them harder to be distinguished from
genuine applications. Given this context, and according to previous works (Maiorca et al.,
2017; Scales et al., 2019, 2021), we investigate if and to what extent state-of-the-art
explainability techniques help to identify the features that characterize ransomware apps, i.e.,
the properties that are required to be present in order to combat ransomware offensives
effectively. Our contribution is threefold:
1.Leveraging the approach of our previous work.we propose practical strategies for the
Identifying the specific samples and ransomware algorithms.
2.We countercheck the effectiveness of our analysis by evaluating the prediction performance
of classifiers trained with the discovered relevant features.
We believe that our proposal can help cyber threat intelligence teams in the early
detection of new ransomware families, and, above all, could be a starting point to help design
other malware detection systems through the identification of their distinctive features. We
first introduce background notions about Android, ransomware attacks, and their detection
followed by a brief illustration of explanation methods.Then, our approach is presented in
Section.Since the explanation methods we consider have been originally designed to indicate
the most influential features for a single prediction, we propose to evaluate the distribution
of explanations rather than individual instances. This statistical view enables us to uncover
characteristics of malware shared across variants of the same family. In our experimental we
analyze the output of explanation methods to extract information about the set of features that
mostly characterize ransomware samples
2
Key Features:
1. Machine learning-based detection engine
2. Advanced threat intelligence and analytics
3. Real-time monitoring and alerting
4. Scalable and flexible architecture
5. Continuous updates and improve
3
2. LITERATURE SURVEY
A good deal of research has been carried out on the subject of detection of malware.
According to various machine learning algorithms which comprise decision trees, random
forest etc. are used for malware detection. The algorithm having highest accuracy is selected
which provides a high detection ratio for the system. The performance of the system is also
detected by calculating the false positive and false negative rates using the confusion. Aim is
to find the files with Malware. According to a novel deep-learning
based architecture is proposed which classifies malware variants based on a hybrid model of
classification. The goal is to provide a new hybrid architecture that integrates two pre-formed
network models in an optimized manner. This architecture consists of four main steps,
namely: the acquisition of data, the conception of a deep neural network architecture, and the
formation of the proposed deep neural network. Many computer users, corporations, and
governments affected due to the rampant increase in malware attacks, malware detection
continues to be a hot research topic. Current malware detection solutions that perform static
and dynamic analysis of malware signatures and behavioural patterns are time consuming
and have proven ineffective at identifying unknown malware in real time.
4
uncover the problems and flaws which motivated to propose solutions and work on this
project.
The current knowledge including substantive findings, as well as theoretical and
methodological contributions to a particular topic. Literature reviews use secondary sources,
and do not report new or original experimental work. Most often associated with academic-
oriented literature, such as a thesis, dissertation or a peer-reviewed journal article, a literature
review usually precedes the methodology and results sectional though this is not always the
case. Literature reviews are also common in are search proposal or prospectus.Its main goals
are to situate the current study within the body of literature and to provide context for the
particular reader. Literature reviews are a basis for researching nearly every academic field.
demic field. A literature survey includes the following: Existing theories about the topic
which are accepted universally.
• Books written on the topic, both generic and specific.
• Research done in the field usually in the order of oldest to latest.
• Challenges being faced and on-going work, if available.
Literature survey describes about the existing work on the given project. It deals with the
problem associated with the existing system and also gives user a clear knowledge on how to
deal with the existing problems and how to provide solution to the existing
problems.Concentrate on your own field of expertise –Even if another field uses the same
words, they usually mean completely.
• It improves the quality of the literature survey to exclude sidetracks Remember to explicate
what is excluded. Before building our application, the following system is taken into
consideration: Malware Analysis and Detection Using Machine Learning Algorithms,
Muhammad Shoaib Akhtar and Tao Feng Malware is a major threat to the security of
computer system sand networks. Traditional signature-based malware detection methods are
becoming increasingly ineffective against new and emerging malware strains. Machine
learning (ML)algorithms have the potential to overcome these limitations by detecting
malware based on its behaviour and other character.
Signature-based detection : Relies on a database of known malware signatures.
Behavioural detection : Monitors system behaviour to identify suspicious activity.
5
Machine Learning (ML) Approaches:
This survey highlights the evolution of malware detection systems, from traditional methods
to ML and DL approaches, and the emerging trends and challenges in this field.
6
3. SYSTEM REQUIREMENTS
HARDWARE REQUIREMENTS:
• Processor: Intel Pentium Core or Above
The processor is the central component of the computer that handles all the logical
instructions and processes. A system with an Intel Pentium Core or a higher version ensures
basic functionality and the ability to process data efficiently.
• RAM: 6 GB
Random Access Memory (RAM) temporarily stores data that the system is actively using.
Having 6 GB of RAM allows the system to run applications smoothly and handle
multitasking effectively. Since RAM is volatile, data is lost when the system is powered off.
• Hard Disk: 64 GB
The hard disk is a non-volatile storage device, meaning it retains data even when the system
is powered off. With 64 GB of storage, the system can store necessary software, files, and
data required for operation.
SOFTWARE REQUIREMENTS:
• Operating Systems: Windows 7 and above or Ubuntu v12.04 and above.
These operating systems provide the environment for running the malware detection system.
Windows is widely used for its user-friendly interface, while Ubuntu is a Linux-based OS
preferred for its stability and open-source features.
• Front End: Python, HTML, CSS, JavaScript
The front-end languages are used to design and develop the user interface.Python supports
the integration of back-end logic.HTML, CSS, and JavaScript build user-friendly, interactive
web pages.
• Data: CSV File
The system processes input data in CSV (Comma-Separated Values) format, a lightweight
and widely-used file type for data storage and transfer.
7
NumPy: Used for numerical computations and handling large datasets.
Seaborn: A library for data visualization to analyze trends and patterns.
These tools and libraries enable the system to implement machine learning
algorithms. They allow the system to improve its detection accuracy by learning from patterns
in data without being explicitly programmed for each scenario.
8
4. FEASIBILITY STUDY
The feasibility of the project is analyze in this phase and a business proposal
is put forth with a very general plan for the project and some cost estimates. During system
analysis the feasibility study of the proposed system is to be carried out. This is to ensure that
the proposed system is not a burden to the company. For feasibility analysis, some
understanding of the major requirements for the system is essential.
The goal of this feasibility study is to determine the practicality of implementing a malware
detection system using machine learning algorithms. The study concludes that a machine
learning-based approach is feasible and can provide accurate and efficient malware detection.
some considerations involved in the feasibility analysis are:
However, there are also costs associated with implementing and maintaining a malware
detection system, including the initial investment in software, hardware, and training, as well
as ongoing maintenance, updates, and subscription fees. Furthermore, the system may
generate false alarms or miss certain threats, leading to wasted resources and potential
security gaps.To determine the economic feasibility of a malware detection system,
organizations should conduct a thorough analysis of the costs and benefits, including
calculating the total cost of ownership, return on investment, and break-even point. This will
help them decide whether the benefits of the system outweigh the costs and whether it is a
worthwhile investment for their organization.
9
acquiring the malware detection software. Commercial solutions, such as those from
CrowdStrike or McAfee, often involve licensing fees based on the number of devices or users,
which can add up quickly depending on the scale of the organization. Open-source
alternatives may reduce licensing costs but can introduce additional expenses in terms of
setup, configuration, and ongoing maintenance, which may require specialized expertise.
Furthermore, if the system is deployed on-premises, there may be additional costs for
necessary hardware upgrades or the purchase of dedicated servers, particularly if the system
requires significant processing power for real-time monitoring or advanced threat detection
features like machine learning.
Ongoing costs are another critical consideration. These may include software updates
to ensure the system remains effective against evolving threats, as well as the potential costs
of system administration. Maintaining a malware detection system often requires dedicated
personnel to monitor alerts, respond to potential incidents, and ensure the system is running
optimally. If the solution is cloud-based, there may be subscription fees based on usage, such
as the number of devices or the volume of data being processed. Operational costs also
include data storage for logs and alerts, which could grow significantly over time, depending
on the size of the organization and the frequency of malware threats.
This includes evaluating hardware and software capabilities, network and infrastructure
compatibility, data storage and management, security and access controls, integration with
existing systems, and technical expertise and resource availability. By considering these
factors, you can determine whether a project or system is technically feasible and make
informed decisions about its implementation.
The technical feasibility of a malware detection system examines whether the existing
technological infrastructure of an organization can support the implementation of the
10
system,and whether the system's capabilities meet the technical requirements to effectively
detect, prevent, and respond to malware threats. This analysis involves evaluating factors
such as compatibility, scalability, system integration, performance, and the effectiveness of
detection methods.
First, the system must be compatible with the organization’s existing IT infrastructure,
including hardware, software, and network configurations. Malware detection solutions can
either be deployed on-premises or as cloud-based services, and the choice between the two
depends on the organization.
The system's ability to integrate with existing workflows and processes, such as incident
response and change management.The availability of trained personnel to operate and
maintain the system.The system's ability to adapt to changing operational requirements and
evolving malware threats.The system's compatibility with existing infrastructure, including
hardware, software, and network architectures.The ability to generate actionable alerts and
reports that inform operational decisions.The system's ability to scale to meet growing
operational demands.The ability to integrate with existing security tools and systems, such as
firewalls and intrusion detection systems.By evaluating these factors, organizations can
determine whether a malware detection system is operationally feasible and can be
successfully integrated into their daily operations.
The operational feasibility of a malware detection system focuses on how well the
system can be deployed, maintained, and used within an organization’s day-to-day
operations. It examines whether the necessary resources, skills, processes, and workflows are
in place to ensure the system can operate effectively and deliver value over time. This aspect
of feasibility considers both the human and technical factors involved in managing and
utilizing the system.From an operational perspective, one of the first factors to consider is the
availability of skilled personnel to manage and operate the malware detection system. A
successful implementation requires IT staff who are trained in cybersecurity practices,
malware detection, and incident response. Depending on the complexity of the system, the
11
organization may need specialized knowledge in areas such as network security, system
administration, or even machine learning if advanced detection methods like behavioral
analysis or AI are involved. If the organization lacks these skills in-house, it may need to
invest in training or hire additional personnel, which could affect the feasibility of the system,
especially in smaller organizations with limited resources.
The deployment and integration of the malware detection system with existing IT
infrastructure is another crucial operational consideration. The system needs to be easily
integrated into the organization’s network, endpoints, and other security systems (e.g.,
firewalls, SIEM, endpoint protection). This requires effective coordination between IT
departments, ensuring that the system does not interfere with other critical operations or
systems. Additionally, the deployment process must be seamless, minimizing downtime and
disruption to employees' work, especially in environments that rely on continuous system
availability.
The legal and ethical feasibility of a malware detection system refers to its compliance
with relevant laws, regulations, and ethical standards. This includes:
Legal requirements:
• Compliance with data protection and privacy laws (e.g., GDPR, HIPAA)
• Compliance with industry-specific regulations (e.g., PCI-DSS for payment card data)
Ethical considerations
By evaluating these factors, organizations can determine whether a malware detection system
is legally and ethically feasible, and ensure that its implementation and operation align with
relevant laws, regulations, and ethical standards.
12
4.5 Schedule Feasibility
An economic feasibility study of a malware detection system evaluates its financial viability
and potential return on investment. Here are some key points to consider:
Development Costs:
• Personnel salaries and benefits
• Software and hardware expenses
• Training and testing costs.
Operational Costs
• Maintenance and updates
• System administration and support
• Energy and infrastructure costs
Benefits
• Reduced malware-related downtime and losses
• Enhanced security and compliance
• Improved system performance and productivity
• Potential cost savings from reduced security breaches
Cost-Benefit Analysis
• Calculate the total cost of ownership (TCO)
• Estimate the return on investment (ROI)
• Compare the costs and benefits to determine feasibility
Break-Even Analysis:
• Calculate the point at which the system's benefits Costs.
• Determine the time required to reach the break-even point.
13
Key elements of technical feasibility include:
Technical feasibility assesses whether a malware detection system can be developed and
implemented using existing technology and resources. Here are some key points to consider:
Technical Requirements
Technical Resources
14
5. SYSTEM ANALYSIS
• Machine type
• DLL characteristics
• Subsystem
The system uses a Flask web application to provide a user interface for uploading executable
files and displaying the results.
Limitations:
The system has limitations, such as:
A high-performance malware detection system using deep learning and feature selection
methodologies is introduced. Two different malware datasets are used to detect malware and
differentiate it from benign activities. The datasets are preprocessed, and then correlation-
based feature selection is applied to produce different feature-selected datasets. The dense
15
and LSTM-based deep learning models are then trained using these different versions of
feature-selected datasets.Techniques Used.
• Accuracy is less than 90%.It is suitable to detect attack is there or not, which is not suitable
16
5.2 PROPOSED SYSTEM
The proposed system, Malware Defender, is an enhanced malware detection and analysis
framework that builds upon the existing system.
The proposed system uses advanced machine learning and deep learning techniques to
The system extracts additional features from the executable file, including:
• Advanced entropy analysis
• Availability analysis
• API call analysis
• String analysis
The system integrates with real-time threat intelligence feeds to stay up-to-date with
the latest malware threats.The approach used in this project aims to use a multi classifier to
detect and classify malware.Malware classification is approached using two techniques of
binary and multi-class problems.The binary classification includes the differentiation
between malicious and benign classes whereas the multi-classification includes classifying
the malicious malware into Virus, Trojan, Spyware, Worms, Ransomware, and Adware type.
Supervised learning approach and machine learning models like Random Forest model,
Decision tree model, Support vector machine model, Naïve Bayes model, and K-Nearest
Neighbour model is used for the classification of malware.
The results show that Random Forest performs well in terms of Binary classification
and the multi-classification problem with an accuracy of 95% and 91% respectively.
Advantages: Less time consumption of Implementation Accuracy is above 90%. .It is
used to detect types of attacks also
File Details:
• Malware detection result
• Recommendations for remediation
The proposed system aims to improve upon the existing system by:
17
• Enhancing detection accuracy and reducing false positives.
• Providing real-time protection against zero-day-attacks and unknown threat Automating
malware analysis and remediation
• Improving scalability and efficiency Algorithm used Ransomware uses various algorithms
to encrypt files, making them inaccessible to victims.
An asymmetric-key algorithm used for encrypting and decrypting data.A cryptographic hash
function used for data integrity.
Elliptic Curve Cryptography (ECC):
Ransomware algorithms
1. WannaCry: Used AES-128-CBC and RSA-2048.
2. Not Petya: Used AES-128-CBC and RSA-4096.
3. Locky: Used AES-128-CBC and RSA-2048.
Keep in mind that ransomware developers constantly evolve and update their encryption
methods, making it essential to stay informed about the latest threats.
18
6. SYSTEM DESIGN
6.1 UML DIAGRAMS
6.1.1 CLASS DIAGRAM
UML class diagram illustrates the interaction and functionality of different components
within a malware simulation environment. At the center of the system is the MASim Agent,
which possesses a range of methods such as inform(), propagate(), simulate malware
behaviour(), create(), and connect(). This agent communicates with the MASim Binary,
which has its own methods like Run(), propagate(), and Create(). Together, these components
simulate the behaviour of malicious software.
19
6.1.2 USE CASE DIAGRAM
The diagram illustrates the workflow of a malware detection and classification system
through various actors and their corresponding actions. The process begins with the Data
Collector, who is responsible for gathering input data, specifically malware and benign files.
This marks the initial step in preparing the dataset for analysis and classification.
The workflow then moves to the PCA (Principal Component Analysis) actor, who
focuses on optimizing the dataset. The first task here is Dimension Reduction, which reduces
the complexity of the data by minimizing the number of features while retaining critical
information. Once the dimensionality is reduced, New Feature Generation takes place,
creating refined features that improve the effectiveness of the model.
20
6.1.3 ACTIVITY DIAGRAM
The classifier comes into play to build and evaluate the machine learning model. The
Build Model step involves training the model on the prepared data. This is followed by the
Test Model phase, where the model’s performance is evaluated to ensure accuracy and
reliability. The last step is Classify benign or Malware, where the trained model determines
whether a given file is benign or malicious, completing the process.
The first task here is Dimension Reduction, which reduces the complexity of the data
by minimizing the number of features while retaining critical information. Once the
dimensionality is reduced, New Feature Generation takes place, creating refined features that
improve the effectiveness of the model.
21
6.1.4 STATE DIAGRAM
A state diagram is a visual representation that illustrates the various states an object can
occupy during its lifecycle and the transitions between those states. It helps in understanding
how an object behaves in response to different events.
In a state diagram, you start with an initial state, which is usually represented by a filled
circle. From there, the diagram shows different states, depicted as rounded rectangles. The
transitions between these states are indicated by arrows, which represent the movement from
one state to another triggered by specific events or conditions.
State diagrams are particularly useful in scenarios where the behaviour of a system is closely
tied to its current state, such as in user interfaces, protocol designs, or any system where
actions depend on the state of the object. They provide clarity on how different states interact
and can help identify potential issues in state transitions.
22
6.1.5 SEQUENCE DIAGRAM
A sequence diagram is a type of interaction diagram that shows how objects interact in a
particular scenario of a use case. It focuses on the order of messages exchanged between the
objects over time, illustrating the sequence of events that occur during a specific process.
In a sequence diagram, you typically have vertical lines representing different objects or
participants involved in the interaction. The horizontal arrows between these lines represent
messages exchanged between the objects. The diagram is read from top to bottom, with the
time progressing as you move down the diagram.
23
6.1.6 DEPLOYMENT DIAGRAM
24
7. SOFTWARE ENVIRONMENT
Malware Detection Engine:At the core of the software environment is the malware
detection engine. This is the component responsible for identifying malicious software,
analyzing behaviour, and determining whether a file, application, or network activity is
suspicious or harmful. Depending on the type of detection system, the engine may employ
different methods, including:
• Dashboards: Visual representations of the malware detection system’s status, such as the
number of active threats, system health, and detection statistics.
• Alert Management: The ability to review, filter, and prioritize security alerts, along with
tools for investigating and responding to incidents.
• Incident Response: Features to isolate infected systems, block malicious files, or initiate
other remediation actions directly from the console.
• Reporting and Forensics: Detailed reports on detected threats, affected endpoints, and
analysis of attack vectors. This helps in auditing, compliance, and post-incident analysis.
26
8. IMPLEMENTATION
8.1 SAMPLE CODING:
App.py
import io
import os
import numpy as np
import pickle
import pefile
import math
import tempfile
from flask import Flask, request, render_template
from model import classify
os.chdir(os.path.dirname(os.path.abspath(__file__)))
app = Flask(__name__)
app.static_folder = 'templates'
def display_dict(dictionary):
details = []
for key, value in dictionary.items():
details.append(f"<b>{key}:</b> {value}")
return "<br>".join(details)
@app.route('/')
def index():
return render_template('index.html')
@app.route('/classify', methods=['POST'])
def classify_file():
exe_file = request.files['exe_file']
if not exe_file.filename.endswith('.exe'):
return {
'error': 'Wrong file uploaded. Please upload a .exe file
}
file_contents = exe_file.read()
result, details = classify(io.BytesIO(file_contents))
return {
'prediction': result,
27
'details': details
}
if __name__ == '__main__':
app.run(debug=True)
INDEX.HTML
<!DOCTYPE html>
<html>
<head>
<title>Malware Defender</title>
<style>
body {
background-color: #F0F2F6;
background-image: url('{{ url_for("static", filename="background.jpg") }}');
/* Replace 'background.jpg' with your image file
*background-repeat: repeat;
background-size: cover;
}
.container {
max-width: 500px;
margin: 50px auto;
padding: 20px;
background-color: white;
border-radius: 10px;
box-shadow: 0 0 10px rgba(0, 0, 0, 0.1);
}
h1, h2, h3 {
color: #333;
}
.btn {
background-color: #4CAF50;
color: white;
padding: 10px 20px;
border: none;
border-radius: 4px;
cursor: pointer;
28
}
.pattern {
background-image: url('{{ url_for("static", filename="pattern.jpg") }}');
/* Replace 'pattern.png' with your pattern image file */
background-repeat: repeat;
}
</style>
<script>
function submitForm() {
var fileInput = document.getElementById('exeFile(CE(SE)');
var file = fileInput.files[0];
if (!file) {
alert("Please upload a file");
return;
}
if (!file.name.endsWith('.exe')) {
alert("Wrong file uploaded. Please upload a .exe file.");
return;
var formData = new FormData();
formData.append('exe_file', file);= new XMLHttpRequest();
xhr.onreadystatechange = function() {
if (xhr.readyState === 4) {
if (xhr.status === 200) {
var result = JSON.parse(xhr.responseText);
displayResult(result);
} else {
alert("Error: " + xhr.statusText);
}
};
xhr.open('POST', '/classify', true);
xhr.send(formData);
function displayResult(result)
var container = document.getElementById('resultContainer');
container.innerHTML = ''\
var heading = document.createElement('h2');
29
heading.textContent = 'File Details:'
container.appendChild(heading);
var list = document.createElement('ul');
for (var key in result.details)
var listItem = document.createElement('li');
listItem.innerHTML = '<b>' + key + ': </b>' + result.details[key];
list.appendChild(listItem)
}
container.appendChild(list)
var resultHeading = document.createElement('h2')
resultHeading.textContent = 'Malware Detection Result:';
container.appendChild(resultHeading);
var resultText = document.createElement('p');
resultText.textContent = result.prediction;
resultText.style.fontSize = '25px';
if (result.prediction === 'File contains malware') {
resultText.style.color = 'red';
resultText.style.fontWeight = 'bold';
} else {
resultText.style.color = 'green';
resultText.style.fontWeight = 'bold';
}
resultText.style.textAlign = 'center';
container.appendChild(resultText);
if (result.error) {
alert(result.error);
}
}
</script>
</head>
<body>
<div class="container pattern">
<h1>Malware Defender</h1>
<h3>What we do?</h3
<p>We will scan the .exe files and determine whether the file has malware or not.</p>
30
<input type="file" id="exeFile" accept=".exe"><br><br
<button class="btn" onclick="submitForm()">Scan</butto<div id="resultContainer"></div>
</div>
</body>
</html>
Model.py
import numpy as np
import pickle
import pefile
import math
import tempfile
import os
def load_model():
with open('randomModel.pkl', 'rb') as file:
model = pickle.load(file)
return model
def classify(exe_path):
print("Classify function started")
model = load_model()
print("Model loaded successfully")
with tempfile.NamedTemporaryFile(delete=False) as temp_file:
temp_file.write(exe_path.read())
temp_file_path = temp_file.name
print("Temporary file created:", temp_file_path)
pe = (temp_file_path)
print("PE file loaded successfully")
section_entropies = []
for section in pe.sections:
section_data = section.get_data()
size = len(section_data)
if size > 0:
entropy = sum((section_data.count(c) / size)
math.log2(section_data.count(c) / size) for c in set(section_data))
section_entropies.append(entropy)
print("Section entropies calculated:", section_entropies)
31
features = {
'Machine': pe.FILE_HEADER.Machine,
'SizeOfOptionalHeader': pe.FILE_HEADER.SizeOfOptionalHeader,
'MajorSubsystemVersion': pe.OPTIONAL_HEADER.MajorSubsystemVersion,
'DllCharacteristics': pe.OPTIONAL_HEADER.DllCharacteristics,
'SizeOfStackReserve': pe.OPTIONAL_HEADER.SizeOfStackReserve,
'SectionsMeanEntropy': sum(section_entropies) / len(section_entropies),
'SectionsMaxEntropy': max(section_entropies),
'Subsystem': pe.OPTIONAL_HEADER.Subsystem,
'ResourcesMaxEntropy': 6,
'VersionInformationSize': 1,
print("Features extracted:", features
resource_directory=
pe.OPTIONAL_HEADER.DATA_DIRECTORY[pefile.DIRECTORY_ENTRY
['IMAGE_DIRECTORY_ENTRY_RESOURCE']]
if resource_directory.VirtualAddress != 0:
resource_section = pe.get_section_by_rva(resource_directory.VirtualAddress)
resource_data = resource_section.get_data()
resources_entropy = sum((resource_data.count(c) / len(resource_data))
math.log2(resource_data.count(c) / len(resource_data)) for c in set(resource_data))
features['ResourcesMaxEntropy'] = resources_entropy
for resource_type in resource_directory.entries:
fhasattr(resource_type, 'name') and resource_type.name.string.decode() == '
VERSIONINFO'
for resource_id in resource_type.directory.entries:
version_info = resource_id.directory.entries[0].data.struct
features['VersionInformationSize'] = version_info.Length
print("Features updated:", features)
lst = []
for feature, value in features.items():
lst.append(value)
print("List created:", lst)
with open('randomModel.pkl', 'rb') as file:
model = pickle.load(file)
print("Model loaded successfully")
32
pred = model.predict([lst])
print("Prediction made:", pred)
if pred[0] == 0:
return "File is safe", features
else;
return "File contains malware", features
os.unlink(temp_file_path)
print("Temporary file deleted")
main.py
import array
import math
import os
import pickle
import joblib
import pefile
def get_entropy(data):
if len(data) == 0:
return 0.0
occurrences = array.array('L', [0] * 256)
for x in data:
occurrences[x if is instance(x, int) else ord(x)] += 1
entropy = 0
for x in occurrences:
if x:
p_x = float(x) / len(data)
entropy -= p_x * math.log(p_x, 2)
return entropy
def get_resources(pe):
resources = []
if hasattr(pe, 'DIRECTORY_ENTRY_RESOURCE'):
try:
for resource_type in pe.DIRECTORY_ENTRY_RESOURCE.entries:
if hasattr(resource_type, 'directory'):
for resource_id in resource_type.directory.entries:
33
if hasattr(resource_id, 'directory'):
for resource_lang in resource_id.directory.entries:
data = pe.get_data(resource_lang.data.struct.OffsetToData,
resource_lang.data.struct.Size)
size = resource_lang.data.struct.Size
entropy = get_entropy(data)
resources.append([entropy, size])
except Exception as e:
return resources
return resources
def get_version_info(pe):
"""Return version info's"""
res = {}
for fileinfo in pe.FileInfo:
if fileinfo.Key == 'StringFileInfo':
for st in fileinfo.StringTable:
for entry in st.entries.items():
res[entry[0]] = entry[1]
if fileinfo.Key == 'VarFileInfo':
for var in fileinfo.Var:
res[var.entry.items()[0][0]] = var.entry.items()[0][1]
if hasattr(pe, 'VS_FIXEDFILEINFO'):
res['flags'] = pe.VS_FIXEDFILEINFO.FileFlags
res['os'] = pe.VS_FIXEDFILEINFO.FileOS
res['type'] = pe.VS_FIXEDFILEINFO.FileType
res['file_version'] = pe.VS_FIXEDFILEINFO.FileVersionLS
res['product_version'] = pe.VS_FIXEDFILEINFO.ProductVersionLS
res['signature'] = pe.VS_FIXEDFILEINFO.Signature
res['struct_version'] = pe.VS_FIXEDFILEINFO.StrucVersion
return res
def extract_info(fpath):
res = {}
try:
pe = pefile.PE(fpath)
except pefile.PEFormatError:
34
return {}
res['Machine'] = pe.FILE_HEADER.Machine
res['SizeOfOptionalHeader'] = pe.FILE_HEADER.SizeOfOptionalHeader
res['Characteristics'] = pe.FILE_HEADER.Characteristics
res['MajorLinkerVersion'] = pe.OPTIONAL_HEADER.MajorLinkerVersion
res['MinorLinkerVersion'] = pe.OPTIONAL_HEADER.MinorLinkerVersion
res['SizeOfCode'] = pe.OPTIONAL_HEADER.SizeOfCode
res['SizeOfInitializedData'] = pe.OPTIONAL_HEADER.SizeOfInitializedData
res['SizeOfUninitializedData'] = pe.OPTIONAL_HEADER.SizeOfUninitializedData
res['AddressOfEntryPoint'] = pe.OPTIONAL_HEADER.AddressOfEntryPoint
res['BaseOfCode'] = pe.OPTIONAL_HEADER.BaseOfCode
try:
res['BaseOfData'] = pe.OPTIONAL_HEADER.BaseOfData
except AttributeError:
res['BaseOfData'] = 0
res['ImageBase'] = pe.OPTIONAL_HEADER.ImageBase
res['SectionAlignment'] = pe.OPTIONAL_HEADER.SectionAlignment
res['FileAlignment'] = pe.OPTIONAL_HEADER.FileAlignment
res['MajorOperatingSystemVersion'] =
pe.OPTIONAL_HEADER.MajorOperatingSystemVersion
res['MinorOperatingSystemVersion'] =
pe.OPTIONAL_HEADER.MinorOperatingSystemVersion
res['MajorImageVersion'] = pe.OPTIONAL_HEADER.MajorImageVersion
res['MinorImageVersion'] = pe.OPTIONAL_HEADER.MinorImageVersion
res['MajorSubsystemVersion'] = pe.OPTIONAL_HEADER.MajorSubsystemVersion
res['MinorSubsystemVersion'] = pe.OPTIONAL_HEADER.MinorSubsystemVersion
res…
35
9. SYSTEM TESTING
System testing is a crucial phase in the development of a malware detection system using
machine learning. System testing in a malware detection system involves a comprehensive
process of validating that the system functions as expected under various conditions, ensuring
that it can effectively detect, prevent, and respond to malicious activities. The goal of testing
is to verify that the system is not only accurate in detecting known and unknown threats but
also performs well in terms of reliability, scalability, and integration with existing
infrastructure. Below is a detailed breakdown of the key types of system testing that should
be conducted in the context of a malware detection system:
Test Objectives:
1. Evaluate the accuracy and effectiveness of the malware detection system.
2. Identify and fix bugs, errors, and vulnerabilities.
3. Ensure the system meets the required specifications and performance criteria.
4. Validate the system's ability to detect various types of malware.
TYPES OF TESTING:
1.Functional Testing:
Functional testing ensures that all core features of the malware detection system are working
as intended. For a malware detection system, this typically involves testing the following:
• Malware Detection Accuracy:
The system should correctly identify both known malware (via signature-based detection)
and unknown malware (using heuristics, behavior analysis, or machine learning models). The
system should be tested with a variety of malware samples to verify that it detects both
common and advanced threats.
• False Positive and False Negative Rates:
The system should be evaluated for false positives (legitimate files identified as malicious)
and false negatives (malicious files that go undetected). A high false positive rate can lead to
alert fatigue and unnecessary system interventions, while a high false negative rate can allow
malware to slip through undetected.
36
• Quarantine and Remediation:
The system’s ability to quarantine infected files, block malicious activity, and provide
remediation (such as deleting or repairing infected files) should be tested. It should also verify
whether the system allows manual intervention when
needed.
• Signature Updates:
The ability of the system to properly handle updates to malware signatures, ensuring that it
is always equipped with the latest threat intelligence, is critical for effective detection.
2. Performance Testing:
Performance testing evaluates how well the malware detection system handles load and
functions under various operational conditions. This includes:
• System Resource Usage: Testing the system’s impact on system performance, such as
CPU, memory, and disk usage. The malware detection system should not cause significant
slowdowns, especially in environments where real-time detection is crucial. It is essential to
ensure the system’s resource consumption is within acceptable limits while still providing
accurate and timely threat detection.
• Scanning Speed: The system’s ability to scan files, processes, and network traffic in real
time or during scheduled scans should be tested for efficiency. Long scan times can be
disruptive to users, particularly in large enterprise environments.
• Scalability: Testing how well the system scales when applied to larger environments, such
as networks with thousands of endpoints. The system should be able to handle an increased
volume of devices, traffic, and data without degradation in performance.
3. Security Testing:
Security testing ensures that the malware detection system itself is secure and protected from
exploitation by attackers. Key areas to test include:
37
• Access Control: Verifying that only authorized personnel can access or modify the system’s
configuration and that user permissions are properly enforced. A compromise of the malware
detection system could allow attackers to disable or tamper with it, so proper security controls
must be in place.
• Data Integrity and Encryption: Testing how the system handles sensitive data, including
ensuring that logs, alert data, and threat intelligence are encrypted during transmission and
storage to prevent unauthorized access.
4. Integration Testing
Integration testing focuses on ensuring that the malware detection system works seamlessly
with other components of the organization's IT infrastructure. This can include:
• Compatibility with Operating Systems and Devices: Verifying that the system is
compatible with various operating systems (Windows, macOS, Linux, etc.), versions, and
devices (laptops, desktops, mobile devices, etc.). Compatibility issues can lead to missed
detections or system failures.
• Integration with Other Security Solutions: Testing the interoperability with other security
tools like firewalls, intrusion detection/prevention systems (IDS/IPS), Security Information
and Event Management (SIEM) systems, and endpoint protection platforms (EPP). The
malware detection system must share data and insights with these tools to enhance the overall
security posture of the organization.
• Network Traffic Monitoring: Ensuring that the system can monitor and detect malware
that spreads via network traffic, such as through malicious web traffic, email attachments, or
file-sharing protocols.
5. Usability Testing
Usability testing evaluates how easy and effective it is for administrators and users to interact
with the malware detection system. This includes:
• Ease of Configuration: The system should allow administrators to configure detection
settings, update signatures, and set scanning schedules without difficulty. Complex or overly
technical configurations can lead to mismanagement or missed threats.
• Alert and Incident Management: Testing how effectively the system communicates
threats to security personnel, including the clarity and usefulness of alerts and notifications.
The system should provide actionable insights rather than overwhelming the user with
irrelevant information.
38
• User Interface: Ensuring that the interface for viewing alerts, managing quarantined files,
and conducting investigations is intuitive and user-friendly. The system should also provide
reporting capabilities that are clear and useful for audit purposes.
6. Regression Testing
Regression testing ensures that updates or changes to the malware detection system (such as
new features, bug fixes, or signature updates) do not introduce new issues or negatively affect
existing functionality. This type of testing is particularly important after software updates or
system patches.
7. Stress and Load Testing:
Stress and load testing examine the system’s ability to function under extreme conditions,
such as handling a surge in network traffic or large volumes of data being scanned. This can
simulate real-world scenarios where the system might face an overload of data, such as during
a large-scale malware outbreak or Distributed Denial of Service (DDoS) attack. Stress testing
helps identify the system’s breaking point and ensures that the system can gracefully recover
from resource overloads.
8. End-to-End Testing:
End-to-end testing evaluates the overall functionality of the malware detection system from
start to finish, simulating real-world attack scenarios and testing the entire process, from
malware detection to incident response. This involves testing the workflow of how threats
are detected, quarantined, remediated, and reported, ensuring that all steps are executed as
expected.
system testing in a malware detection system is a critical part of ensuring that the
solution is effective, reliable, and secure. A thorough testing process covers everything from
core detection capabilities to performance, security, integration, and usability. By addressing
all these aspects, organizations can ensure that their malware detection system provides
comprehensive protection against evolving threats while maintaining operational efficiency
and security.
39
TEST DATA
• Malware samples (various types and strains)
• Benign files (different formats and sizes)
• System logs and network traffic captures
TEST ENVIRONMENT
• Virtual machines or sandbox environments for testing
• Different operating systems and software configurations
• Network simulation tools for testing network-based detection
TESTING TOOLS
Automated testing frameworks (e.g., Pytest, Unit test)
Performance testing tools (e.g., Apache JMeter, Gatling)
Security testing tools (e.g., Metasploit, Burp Suite)
Debugging tools (e.g., GDB, Valgrind)
TESTING SCHEDULE
Develop a testing schedule to ensure thorough testing Allocate sufficient time for each test
case and test cycle Plan for iterative testing and continuous improvement.
Test Data:
1. Malware Samples: Collect a diverse set of malware samples, including:
a. Viruses
b. Trojans
c. Ransomware
d. Spyware
e. Adware
2. Benign Files: Collect a large dataset of benign files, including:
a. Executable files
b. Document files
c. Image files
d. Audio files
3. Test Environments: Set up test environments to simulate real-World scenarios, including:
40
• Windows and Linux operating systems
• Different network configuration
41
10. SCREENSHOTS
This application is likely designed to scan .exe files to identify potential malware. The layout
suggests a minimalistic and user-friendly approach. The desktop taskbar indicates that this is
running on a Windows operating system.
42
Fig 9: Malware defender result
This image displays the interface of the "Malware Defender" application, showcasing its
functionality after scanning a file. The interface has a simple design, with a prominent title at
the top reading "Malware Defender." Below the title is a brief explanation of the application's
purpose, stating that it scans .exe files to determine if they contain malware.
A green "Scan" button is visible, indicating the process has been initiated. The scan results
are displayed below, providing technical details about the file, such as DllCharacteristics,
Machine, MajorSubsystemVersion, and other parameters that describe the file's attributes.
Finally, the malware detection result is clearly shown at the bottom, indicating that the file
is safe. The background features a hexagonal pattern with a gradient of blue and red tones,
adding a modern and sleek aesthetic to the application interface.
43
11. CONCLUSION
Working on the gaps in existing models in the industry, we have proposed an efficient
system that consists of several individually powerful technologies combined together to
make a sustainable and efficient method of scanning and detecting malware in a windows
system and finding meaningful insights from the same. The proposed model is tailored to
handle various Windows PE malwares and try to detect them as accurately as possible. The
system is said to reduce the false positive rates and false negative rates to produce an
effective result and alert on the spot if any of the files are malicious. The backbone of our
system is the Jupyter Notebook and the numerous tools it provides. It also employs a few
open-source tools and a cloud storage solution to efficiently store and manage data With
this system, we aim to revolutionise Malware Detection and lead towards a safe and secure
future.
44
12. FUTURE SCOPE
45
13. REFERENCES
[1] P. Singh, S. Kaur, S. Sharma, G. Sharma, S. Vashisht and V. Kumar, Malware Detection
Using Machine Learning" Classification Framework Based on Deep Learning Algorithms"
[2] R. Vinaya kumar, M. Alazab, K. P. Soman, P. Poornachandran and S. Venkatraman,
"Robust Intelligent Malware Detection
[3] W. Han, J. Xue and K. Qian, "A Novel Malware Detection Approach Based on
Behavioural Semantic Analysis and LSTM Model," 2021
[4] H. Soni, P. Kishore and D. P. Mohapatra, "Opcode and API Based Machine Learning
Framework for Malware Classification," 2002
[5] M. Masum, M. J. Hossain Faruk, H. Shahriar, K. Qian, D. Lo and M. 1. Adnan,
"Ransomware Classification and Detection with Machine Learning.
46