0% found this document useful (0 votes)
44 views61 pages

Automated Malware Analysis Update

Its a project work or thesis on Automated Malware Analysis

Uploaded by

Chukwu Victor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views61 pages

Automated Malware Analysis Update

Its a project work or thesis on Automated Malware Analysis

Uploaded by

Chukwu Victor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 61

AUTOMATED MALWARE DETECTION SYSTEM

BY

KRAMA GIDEON OZUZHIM

U2019/5570122

Page | 1
DECLARATION

I hereby declare that this project titled “Automated Malware Detection System” is the
result of my own research and work, carried out as part of my academic requirements
under the supervision of Dr Ugochi Okengwu. This project was developed with integrity
and adherence to academic and professional standards.
All sources of information, references, and data used have been duly acknowledged. This
work has not been submitted in any previous application for a degree or qualification, nor
has it been presented elsewhere for academic credit.
I affirm that this project represents my personal contributions to the field of
cybersecurity, and I am responsible for the work presented within.
You may customize this as needed, especially if your institution has specific
requirements for project declarations.

……..……………………… ………………………………
DR. UGOCHI OKENGWU DATE
(PROJECT SUPERVISOR)

Page | 2
Contents
ACKNOWLEDGEMENT.................................................Error! Bookmark not defined.
ABSTRACT........................................................................Error! Bookmark not defined.
DEDICATION...................................................................Error! Bookmark not defined.
CHAPTER ONE................................................................Error! Bookmark not defined.
INTRODUCTION..............................................................Error! Bookmark not defined.
1.1 Background to the Study..........................................Error! Bookmark not defined.
1.2 Statement of Problem...............................................Error! Bookmark not defined.
1.3 Significance of the Study..........................................Error! Bookmark not defined.
1.4 Aim and Objectives...................................................Error! Bookmark not defined.
1.5 Definition of Terms...................................................Error! Bookmark not defined.
1.6 Limitations of Study.................................................Error! Bookmark not defined.
CHAPTER TWO...............................................................Error! Bookmark not defined.
LITERATURE REVIEW..................................................Error! Bookmark not defined.
2.1 Introduction..............................................................Error! Bookmark not defined.
2.2 The Evolution of Malware and the Need for Automated Analysis..............Error!
Bookmark not defined.
2.3 Techniques for Automated Malware Analysis.......Error! Bookmark not defined.
2.4 Challenges and Future Directions...........................Error! Bookmark not defined.
2.5 Overview of Malware Detection Techniques..........Error! Bookmark not defined.
2.7 Review of Related Literature...................................Error! Bookmark not defined.
CHAPTER THREE...........................................................Error! Bookmark not defined.
ANALYSIS AND DESIGN................................................Error! Bookmark not defined.
3.1 Analysis of Existing System......................................Error! Bookmark not defined.
3.2 Advantages of Existing System................................Error! Bookmark not defined.
3.3 Analysis of Proposed System....................................Error! Bookmark not defined.
3.4 Advantages of Proposed System..............................Error! Bookmark not defined.
3.5 Design of Proposed System......................................Error! Bookmark not defined.

Page | 3
3.5.1 Methodology of Proposed System.........................Error! Bookmark not defined.
3.5.2 Flowchart of Proposed System..............................Error! Bookmark not defined.
3.5.3 Use Case Diagram of Proposed System...........Error! Bookmark not defined.
3.5.4 Database Structure of the Proposed System........Error! Bookmark not defined.
3.5.5 Architecture of Proposed System.........................Error! Bookmark not defined.
CHAPTER FOUR..............................................................Error! Bookmark not defined.
IMPLEMENTATION.......................................................Error! Bookmark not defined.
4.1 Hardware requirement.............................................Error! Bookmark not defined.
4.2 Software requirement...............................................Error! Bookmark not defined.
4.3 Output........................................................................Error! Bookmark not defined.
CHAPTER FIVE...............................................................Error! Bookmark not defined.
SUMMARY, CONCLUSION AND RECOMMENDATIONS.....Error! Bookmark not
defined.
5.1 Summary...................................................................Error! Bookmark not defined.
5.2 Conclusion.................................................................Error! Bookmark not defined.
5.3 Recommendations.....................................................Error! Bookmark not defined.
5.4 Contribution to Knowledge......................................Error! Bookmark not defined.
REFERENCES...................................................................Error! Bookmark not defined.
Khan, M. H., & Khan, I. R. (2017). Malware Detection and Analysis. International
Journal of Advanced Research in Computer Science, 8(5), 1147-1149. Retrieved from
[IJARCS](https://www.ijarcs.info)...................................Error! Bookmark not defined.
Kleber, D., & Rios, D. (2019). Automated Malware Analysis: A Python Approach.
Python Security Handbook...............................................Error! Bookmark not defined.
Talukder, S. (2020). Tools and Techniques for Malware Detection and Analysis.
Retrieved from
[ResearchGate](https://www.researchgate.net/publication/339301928_Tools_and_Tech
niques_for_Malware_Detection_and_Analysis)...............Error! Bookmark not defined.
APPEDIX A: SOURCE CODE........................................Error! Bookmark not defined.
APPENDIX B: SCREENSHOOT OF OUTPUT.............Error! Bookmark not defined.

Page | 4
CERTIFICATION

This is to certify that the project titled “Automated Malware Detection System” was
carried out by Krama Gideon Ozuzhim with Mat number u2019/5570122, in partial
fulfillment of the requirements for the award of Bachelor of Science in Computer Science
at University of Portharcourt.

This work is original and has been conducted under the supervision of Dr. Ugochi
Okengwu. It adheres to the academic standards and guidelines of the institution.

DR. UGOCHI OKENGWU ……………… ….………..


(PROJECT SUPERVISOR) SIGNATURE DATE

………………………………….. ………………. ….…….……

EXTERNAL EXAMINER SIGNATURE DATE

Page | 5
ACKNOWLEDGEMENT
I begin by giving thanks to God Almighty, whose guidance, wisdom, and unwavering
support have seen me through every challenge of this project. His grace has been my
source of strength, and I am deeply grateful for His presence in my life.
I would also like to express my heartfelt gratitude to my family for their constant love,
encouragement, and belief in me.
My sincere thanks go to my Head of Department, Dr. Mrs. Ugochi Okengwu, whose
expert guidance and support throughout the duration of this project have been invaluable.
Your dedication to excellence and commitment to nurturing students like me have made a
lasting impact in my life.
To everyone who has played a role in this project, whether directly or indirectly, I offer
my deepest appreciation. Your contributions have made this achievement possible.

Page | 6
ABSTRACT
This project presents an automated malware detection system designed to improve the
accuracy and speed of identifying malicious files. The system integrates three distinct
methods—static analysis, dynamic analysis, and machine learning to offer a robust,
multi-layered approach to malware detection. Static analysis is used to examine
executable files for specific attributes such as headers and hashes, allowing quick
identification of known threats. Dynamic analysis complements this by monitoring the
runtime behavior of executables to detect suspicious activities in real-time.
To further enhance detection capabilities, a Random Forest machine learning model was
trained on a dataset of benign and malicious samples. This model helps the system
rapidly identify previously unseen malware by learning patterns associated with
malicious behaviors. The combination of these techniques provides a more
comprehensive detection system that balances speed and accuracy, making it adaptable to
evolving cyber threats. This project demonstrates the effectiveness of a multi-faceted
approach in enhancing cybersecurity defenses and aims to contribute to the advancement
of automated malware detection technology.

Page | 7
DEDICATION
I dedicate this project to God, for His infinite wisdom, guidance, and strength, which
have carried me through every challenge and success along this journey. I also dedicate
this work to the next generation of cybersecurity specialists, whose passion and
commitment will continue to shape the future of digital security. May this project inspire
them to innovate, protect, and safeguard the digital world, advancing the fight against
cyber threats for a safer tomorrow.

Page | 8
CHAPTER ONE

INTRODUCTION
1.1 Background to the Study
The exponential growth in internet usage and software applications has simultaneously
increased the risk of cyber threats. Among these threats, malware poses a substantial risk,
as it is designed to damage, disrupt, or gain unauthorized access to computer systems.
Traditional antivirus programs detect malware by using predefined signature databases,
but the continuous emergence of new and obfuscated malware variants has rendered these
methods less effective. This study introduces an automated malware detection system that
integrates static and dynamic analysis with machine learning to improve the accuracy and
efficiency of malware identification. Traditional malware detection methods often rely on
manual inspection, where cybersecurity professionals analyze files for signs of malicious
activity. While effective in some cases, manual analysis is time-consuming and prone to
human error. It also struggles to keep pace with the rapidly evolving techniques used by
modern malware to avoid detection (Namanya et al., 2015). This has created an urgent
need for faster, more reliable ways to detect and respond to malware.

To address these challenges, automated malware analysis systems have been developed.
These systems use advanced algorithms and machine learning techniques to analyze
malicious files automatically, without the need for human intervention (Kleber & Rios,
2019). By automating the detection process, these tools can rapidly identify and respond
to malware, reducing the time it takes to mitigate threats and minimizing the risk of
damage to systems (Cuckoo Sandbox Project, 2021). Automated systems can also
analyze large volumes of data, improving overall security by catching threats that may be
missed during manual inspections.

The rise of automated malware analysis systems is transforming the field of


cybersecurity, enabling organizations to protect their networks more effectively and
efficiently. However, there is still a need to adapt these systems to specific regional

Page | 9
contexts, like Nigeria, where unique cybersecurity challenges exist. As more threats
emerge, the continuous development of these automated tools remains a crucial aspect of
defending against evolving cyberattacks.

1.2 Statement of Problem


The continuous evolution of malware poses significant challenges to traditional detection
methods, primarily those relying on static signature-based approaches. Signature-based
systems are effective at identifying known malware by matching file characteristics
against a predefined database. However, these systems are inherently limited when faced
with new, unknown, or polymorphic malware that changes its signature with each
infection, effectively evading detection.
Additionally, traditional antivirus solutions often lack the ability to analyze the behavior
of executables dynamically, limiting their effectiveness against malware that appears
benign but executes malicious actions only at runtime. Without a deeper inspection of
executable files and their behavior, these detection systems may fail to identify
sophisticated malware, exposing users and organizations to substantial security risks.
This project addresses these challenges by developing a multi-layered malware detection
system that integrates static and dynamic analysis with a machine learning classifier. The
static analysis inspects the structure and features of an executable, the dynamic analysis
submits files to VirusTotal for real-time threat assessment, and the machine learning
model classifies unknown files based on learned patterns from malicious and benign
samples. Together, these approaches create a more robust detection system capable of
identifying both known and unknown malware, thus filling the gaps left by conventional
detection methods.

Page | 10
1.3 Significance of the Study
1. Improved Malware Detection Accuracy
Traditional malware detection methods, such as signature-based techniques, are
becoming increasingly ineffective due to the constant evolution of malware. By
combining static analysis, dynamic analysis, and machine learning, this study
provides a more accurate and comprehensive approach. Static analysis extracts file
features without execution, dynamic analysis observes behavior during runtime, and
machine learning enables the system to learn from historical data, identifying new and
previously unseen malware variants. This multi-layered approach significantly
improves detection accuracy compared to traditional methods.
2. Adaptability to Evolving Malware
One of the major challenges in malware detection is the continuous development of
new strains. Traditional methods struggle to detect modified or obfuscated malware.
The proposed system adapts to emerging threats by analyzing files from multiple
perspectives—structural (static) and behavioral (dynamic)—while the machine
learning component learns from evolving data. This adaptability allows the system to
maintain high detection rates even as malware changes.
3. Efficiency in Malware Identification
By automating the malware detection process, this study offers a more efficient
alternative to manual inspection. The use of automated static and dynamic analysis
significantly reduces the time required to identify malicious files. The machine
learning model, once trained, can classify new files in a fraction of the time it would
take for manual methods or traditional signature-based systems to process large
volumes of data. This efficiency makes the system a valuable tool for real-time
malware detection in both individual and enterprise-level security operations.
4. Contribution to Cybersecurity Research
This study contributes to the ongoing research in the field of cybersecurity by
proposing an innovative hybrid approach to malware detection. By integrating
multiple detection techniques, it pushes the boundaries of traditional detection

Page | 11
methods, highlighting the potential of combining behavioral analysis and machine
learning for more robust security systems. The findings of this study may inspire
future research and development of next-generation malware detection tools.
1.4 Aim and Objectives

The aim of this project is to design an automated Malware Detection System:

I. Gather the Datasets from ‘malware bazeer’


II. Train and test the algorithm to become a malware detection system using ‘Random
forest.’
III. The pefile library is employed to extract essential PE file details, such as entry
points and sections, for effective static analysis.
IV. hashlib computes MD5 and SHA256 hashes to ensure file integrity and cross-
check against known malware signatures.
V. The requests and json libraries facilitate integration with the API, for dynamic
analysis and retrieving detailed reports.

1.5 Definition of Terms


1. Malware
Malware, or "malicious software," is any software specifically designed to disrupt,
damage, or gain unauthorized access to systems. It includes a variety of harmful
programs, such as viruses, worms, trojans, ransomware, and spyware, each with unique
behaviors and attack strategies. Malware detection systems aim to prevent, identify, and
remove these threats, which pose risks to data security, privacy, and overall system
integrity.
2. Malware Detection
Malware detection refers to the methods and processes used to identify and mitigate
malware. Effective detection involves both identifying known threats and discovering
new ones. Detection techniques vary, from traditional methods like signature-based

Page | 12
detection to modern approaches involving behavior-based and machine learning methods.
The goal of malware detection is to minimize the impact of malicious software by
identifying it early and accurately.
3. Static Analysis
Static analysis is the examination of files or software without executing them. In malware
detection, static analysis inspects a file's structure—such as its code, metadata, imports,
and sections—searching for indicators of malicious behavior. Key techniques include
hash calculation, header analysis, and entropy calculation. Static analysis is fast and safe
since it doesn’t involve running potentially dangerous code. However, it has limitations,
particularly against obfuscation techniques that can disguise malware within files.
4. Dynamic Analysis
Unlike static analysis, dynamic analysis involves executing the file in a controlled
environment (e.g., sandbox or virtual machine) to observe its behavior. This method is
useful for detecting behaviors like unauthorized file changes, network connections, or
interactions with system processes that may indicate malware. Services like VirusTotal
facilitate dynamic analysis by providing access to virtual environments where malware
can be safely observed. Although dynamic analysis is highly effective for behavior-based
detection, it requires significant resources and time, making it less practical for large-
scale real-time detection.
5. Machine Learning
Machine learning (ML) is a branch of artificial intelligence that enables systems to learn
from data, improve over time, and make predictions or decisions without being explicitly
programmed. In malware detection, machine learning can analyze patterns in data to
classify files as benign or malicious. Machine learning models are trained on large
datasets, learning features that distinguish malware from legitimate files. By recognizing
these patterns, ML models can detect new malware variants, making this approach more
robust against evolving threats compared to traditional methods.
6. Feature Extraction

Page | 13
Feature extraction is the process of identifying and isolating specific attributes of a file
that are useful for distinguishing between malware and benign software. Features might
include file size, entropy, PE header information, and imported functions. In a machine
learning-based malware detection system, feature extraction is critical because it provides
the model with data points that it uses to detect malware patterns. Effective feature
selection and extraction improve model accuracy and efficiency, helping it to generalize
across different types of malware.
7. False Positive and False Negative
False positives and false negatives are important metrics in malware detection that reflect
the accuracy and reliability of the system:
False Positive: Occurs when a benign file is incorrectly classified as malware. High
false-positive rates can reduce trust in the detection system and lead to unnecessary
resource usage.
False Negative: Occurs when a malicious file is misclassified as benign, allowing it to
evade detection. False negatives are particularly dangerous because they allow malware
to bypass security measures and cause harm to the system.
Reducing both false positives and false negatives is cruci

1.6 Limitations of Study


1. Variability in Malware

Malware is constantly evolving, with new and more sophisticated versions being created
regularly. These new variants might have different characteristics or use advanced
techniques to avoid detection. As a result, the system developed for detecting malware
may struggle to identify these novel threats effectively. In other words, if the malware is
too new or too cleverly designed, it might slip through the system's defenses.

2. Internet and power supply

Infrastructural challenges, such as inconsistent internet access and power supply,


impacted the consistency and speed of my research.
Page | 14
3. False Positives and Negatives

Automated malware detection systems are not perfect and can sometimes make errors.
There are two main types of errors I have noticed, which are:

False Positives: This happens when the system incorrectly identifies a legitimate,
harmless file as malicious. This can lead to unnecessary alerts or actions being taken
against non-threatening files.

False Negatives: This occurs when the system fails to detect an actual malicious file,
allowing it to go unnoticed. This means a real threat could bypass the system
undetected, potentially causing harm.

Page | 15
CHAPTER TWO

LITERATURE REVIEW
2.1 Introduction
The increasing sophistication and volume of malware attacks pose significant challenges
to cybersecurity. Traditional methods of malware detection, relying heavily on signature-
based approaches and manual analysis, are no longer sufficient to address the evolving
landscape of cyber threats. Consequently, automated malware analysis, particularly
through the development of Python programs, has gained traction as a method to enhance
threat detection and mitigation efforts. This literature review explores the existing
research on automated malware analysis, focusing on Python-based methods for
identifying malicious executables.

2.2 The Evolution of Malware and the Need for Automated Analysis
Malware, short for malicious software, encompasses a variety of harmful software types,
including viruses, worms, trojans, and ransomware. The rapid evolution of malware has
outpaced traditional detection methods. According to Vinod (2009), signature-based
detection, while effective against known threats, struggles with detecting new and
polymorphic malware variants. The need for automated, behavior-based detection
methods has thus become paramount.

Python as a Tool for Malware Analysis

Python's popularity in cybersecurity is due to its simplicity, extensive libraries, and


versatility, making it an ideal choice for developing malware analysis tools. Zeltser
(2015) emphasized that Python’s vast array of libraries, such as pefile for parsing
Portable Executable (PE) files and scikit-learn for implementing machine learning
models, provide powerful tools for automating malware analysis. These libraries enable
researchers and security professionals to develop programs that can efficiently identify
and categorize malicious executables.

Page | 16
2.3 Techniques for Automated Malware Analysis
Automated malware analysis can be broadly categorized into static and dynamic analysis
techniques. Static analysis involves examining the executable without running it,
focusing on aspects such as file structure, strings, and binary patterns. Dinaburg . (2008)
highlighted the effectiveness of static analysis in identifying malware characteristics by
analyzing code signatures, while Faruki . (2014) discussed the limitations of static
analysis in detecting obfuscated or packed malware.

Dynamic analysis, on the other hand, involves executing the malware in a controlled
environment (sandbox) to observe its behavior. Research by Egele . (2008) demonstrated
that dynamic analysis can effectively identify malicious behavior patterns, such as
network activity and system modifications, which are often missed by static methods.
Python, with tools like Cuckoo Sandbox, has been instrumental in advancing dynamic
analysis techniques by automating the observation and recording of malware behavior in
real-time.

Machine Learning and Malware Detection

Recent studies have focused on incorporating machine learning (ML) techniques into
automated malware analysis. Anderson and Roth (2018) explored the application of
machine learning models, such as decision trees and neural networks, to classify
executables as malicious or benign based on features extracted from both static and
dynamic analysis. Their study demonstrated that ML models could significantly enhance
detection accuracy, particularly when trained on large datasets of labeled malware
samples.

Raff . (2020) further advanced this field by introducing deep learning techniques for
malware detection. Their research showed that deep neural networks could automatically
learn complex features from raw binary data, reducing the need for manual feature
engineering. Python's ML libraries, including TensorFlow and PyTorch, have facilitated

Page | 17
these advancements by providing robust frameworks for implementing and
experimenting with deep learning models.

2.4 Challenges and Future Directions


Despite the progress made in automated malware analysis, several challenges remain.
The ever-evolving nature of malware, particularly with techniques like code obfuscation
and polymorphism, poses significant hurdles for static analysis methods. Dynamic
analysis, while effective, is resource-intensive and may not scale well for real-time
detection in large networks.

Moreover, the reliance on labeled datasets for training machine learning models presents
a challenge, as labeling large volumes of malware samples is time-consuming and
requires expert knowledge. Future research should focus on developing unsupervised or
semi-supervised learning techniques that can detect novel malware without extensive
labeled datasets.

2.5 Overview of Malware Detection Techniques


Malware detection techniques are generally classified into three broad categories:
signature-based detection, anomaly-based detection, and hybrid methods.

1. Signature-Based Detection: This is the most traditional approach, where known


patterns (signatures) of malware are stored in a database. Tools like ClamAV use
signature databases to scan files for matches. However, as noted by Vinod .
(2009), this method struggles to detect new, unknown, or polymorphic malware,
which can change its code to avoid detection.

2. Anomaly-Based Detection: Anomaly-based methods detect deviations from


normal behavior. These techniques are particularly effective in identifying zero-
day attacks, as they do not rely on known signatures. According to Denning
(1987), anomaly detection is grounded in statistical models, and machine learning
(ML) plays a significant role in modern implementations. Python’s ML libraries,

Page | 18
such as Scikit-learn and TensorFlow, have been instrumental in developing
models that learn from benign and malicious behavior to identify anomalies.

3. Hybrid Methods: Hybrid approaches combine signature-based and anomaly-


based methods, leveraging the strengths of both. For example, Santos . (2011)
proposed a hybrid framework where initial filtering is done using signature-based
methods, followed by deeper analysis through anomaly detection and machine
learning models. Python’s versatility allows for seamless integration of these
techniques, enabling the development of robust and comprehensive malware
detection systems.

Static and Dynamic Malware Analysis

Automated malware analysis can be performed using static, dynamic, or hybrid


approaches, each with its own set of tools and methodologies.

1. Static Analysis: Static analysis involves examining the binary code of an


executable without executing it. This method includes analyzing the file’s
structure, strings, and embedded resources. Tools like pefile (a Python module)
allow for the parsing and inspection of PE files, which are common formats for
Windows executables. According to You and Yim (2010), static analysis is
efficient for detecting known malware but is limited by its inability to handle
obfuscated or packed malware.

2. Dynamic Analysis: In contrast, dynamic analysis observes the behavior of


malware during execution in a controlled environment, such as a sandbox. Cuckoo
Sandbox, an open-source Python-based tool, has been widely adopted for dynamic
malware analysis. As Egele . (2008) discussed, dynamic analysis is effective in
uncovering malicious behavior that might be hidden through static obfuscation
techniques. However, it is resource-intensive and requires a safe environment to
prevent actual system infection.

Page | 19
3. Hybrid Analysis: Hybrid analysis combines static and dynamic methods to
provide a more comprehensive evaluation of a potential threat. Research by
Anderson . (2011) demonstrated that combining the strengths of both approaches
leads to higher detection rates and better accuracy in classifying malware.

Machine Learning in Malware Detection

The integration of machine learning into malware detection has revolutionized the field,
enabling more sophisticated analysis and classification of threats.

1. Feature Extraction: Feature extraction is a critical step in machine learning-based


malware detection. Static features include byte sequences, API calls, and opcodes,
while dynamic features might involve system calls, network traffic, and file
system changes. Kolter and Maloof (2006) demonstrated that carefully selected
features significantly improve the performance of machine learning models in
detecting malware. Python libraries like Scikit-learn and Pandas are essential for
feature extraction and preprocessing in these models.

2. Supervised Learning Models: Supervised learning models, including decision


trees, random forests, support vector machines (SVMs), and neural networks, are
trained on labeled datasets of benign and malicious executables. Anderson and
Roth (2018) highlighted that supervised models could achieve high accuracy,
particularly when trained on large, diverse datasets. Python’s ML libraries
facilitate the implementation of these models, enabling rapid experimentation and
iteration.

3. Unsupervised Learning Models: Unsupervised learning is used for anomaly


detection and clustering, where the model identifies patterns in data without
labeled examples. Alazab . (2013) explored the use of unsupervised learning for
detecting novel malware variants by clustering similar malicious behaviors.

Page | 20
Python’s versatility allows researchers to implement a range of unsupervised
algorithms, such as k-means clustering, for these purposes.

4. Deep Learning Models: Deep learning models, particularly convolutional neural


networks (CNNs) and recurrent neural networks (RNNs), have shown promise in
malware detection. Raff . (2020) introduced a deep learning approach where raw
binary data from executables is fed directly into the model, bypassing the need for
manual feature extraction. This method leverages Python’s deep learning libraries
like TensorFlow and PyTorch, demonstrating high accuracy and the ability to
generalize across different types of malware.

Challenges in Automated Malware Analysis

Despite the advancements in automated malware analysis, several challenges persist:

1. Evasion Techniques: Malware developers continuously evolve evasion


techniques, such as polymorphism, metamorphism, and sandbox detection, to
bypass static and dynamic analysis. A study by Christodorescu . (2005)
emphasized the need for continuous updates to detection methods to keep pace
with these advancements. Python's adaptability allows for quick updates to
analysis tools, but the cat-and-mouse game between malware developers and
defenders remains ongoing.

2. Scalability: Scaling automated malware analysis to handle large volumes of files


in real-time is challenging. Dynamic analysis, in particular, is resource-intensive
and may not be feasible for high-throughput environments. However, hybrid
approaches and cloud-based solutions offer potential paths forward, as discussed
by Bayer . (2009).

3. Labeling and Dataset Availability: The effectiveness of machine learning


models depends heavily on the quality and quantity of labeled datasets. However,
labeling large datasets of malware samples is time-consuming and requires expert
knowledge. To address this, Bilge and Dumitras (2012) suggested leveraging
Page | 21
semi-supervised learning techniques that use both labeled and unlabeled data,
reducing the dependency on fully labeled datasets.

4. Interpretability of Machine Learning Models: As machine learning models,


particularly deep learning models, become more complex, their interpretability
decreases. Understanding why a model classifies an executable as malicious or
benign is crucial for trust and accountability in cybersecurity applications.
Ribeiro . (2016) proposed using model-agnostic techniques to improve
interpretability, a research area that continues to evolve.

Future Directions

The future of automated malware analysis lies in addressing the challenges identified and
enhancing current methodologies. Research should focus on:

1. Increasing Evasion Resistance: The current system performs static and dynamic
analysis to identify malicious files, but sophisticated malware can evade detection by
using advanced obfuscation or encryption methods. Future work should focus on
enhancing the static analysis component to detect obfuscation more reliably, such as
by improving entropy-based analysis and incorporating additional PE header
attributes. Additionally, expanding the machine learning model to include features
derived from both static and dynamic analyses could help the system learn more
complex patterns of evasion.
2. Enhancing Model Interpretability and Reducing Data Dependency: The current
machine learning model is trained on labeled datasets, but acquiring and labeling large
datasets can be challenging. Future research could explore feature reduction or
advanced transfer learning to generalize from smaller datasets. Additionally, model
interpretability tools could be integrated to understand which features most influence

Page | 22
predictions, aiding in transparency and debugging, especially when dealing with high-
stakes malware detection.
3. Refining Hybrid Detection Approaches: The combination of static, dynamic, and
machine learning-based approaches in this system is an effective hybrid model. Future
enhancements could involve implementing a more seamless integration between static
and dynamic features, allowing the model to consider behavioral indicators alongside
structural attributes. This hybrid approach would make the system more robust,
particularly in identifying malware that exhibits suspicious behavior patterns post-
execution.

2.7 Review of Related Literature


1. Malware Detection and Analysis

Functionality: The paper by Khan and Khan (2017) provides an in-depth exploration of
various techniques for malware detection and analysis, focusing on both static and
dynamic analysis methods. Static analysis involves inspecting the code without executing
it, while dynamic analysis involves running the malware in a controlled environment to
observe its behavior. The study introduces a framework designed to simplify malware
detection by allowing users to analyze binary files using Python scripts. The framework
enables users to scan through complex code to extract valuable information about the
malware’s structure and behavior.

Problem Solved: This paper addresses the growing need for effective tools to detect and
analyze increasingly sophisticated malware. By leveraging both static and dynamic
analysis techniques, the framework proposed in the paper helps users understand the
potential impact of malware on a system and develop appropriate signatures for detecting
malware infections on networks. The use of Python in the framework allows for
automation and simplification of the malware analysis process, making it more accessible
to users with varying levels of technical expertise.

Page | 23
Identified Gap: While the framework presented in the paper is comprehensive, it does
not specifically address the challenges faced in regions with limited cybersecurity
infrastructure, such as Nigeria. Additionally, the framework's reliance on general
malware signatures may not be fully effective in detecting region-specific threats that are
prevalent in countries like Nigeria. There is also a lack of emphasis on integrating
localized datasets into the analysis process, which is crucial for improving detection
accuracy in specific regions.

Proposed Solution:

My project can address these gaps by developing an automated malware detection tool
tailored for the Nigerian context. This tool would combine both static and dynamic
analysis methods, optimized for low-resource environments. By integrating a localized
dataset of Nigerian malware samples, the tool can improve detection accuracy and
relevance, making it more effective in addressing the unique cybersecurity challenges
faced in Nigeria.

Page | 24
CHAPTER THREE

ANALYSIS AND DESIGN


3.1 Analysis of Existing System
Traditional malware detection systems on Windows OS primarily rely on signature-based
detection, where known malware signatures are matched against a database. While
effective against known threats, this method struggles with new, unknown, or
polymorphic malware that alters its signature to avoid detection.

Current Systems:

1. Antivirus Software: Relies on signature-based detection, heuristic analysis, and


sometimes sandboxing to catch malicious activity.

2. Intrusion Detection Systems (IDS): Often network-based, these systems monitor


network traffic for suspicious activity, but they may not catch malware that acts
locally on a machine.

3. Heuristic Analysis: Some antivirus programs analyze the behavior of programs in


a controlled environment to detect potential threats. However, this can lead to false
positives.

Limitations:

1. Slow Response to New Threats: Signature updates are required to detect new
malware, leading to a window of vulnerability.

2. Resource Intensive: Heuristic analysis and sandboxing can consume significant


system resources.

3. False Positives/Negatives: Legitimate programs can be flagged as malware, while


sophisticated malware can evade detection.

Page | 25
3.2 Advantages of Existing System
1. Wide Adoption: Antivirus software is widely used, making it a first line of
defense for many users.

2. Extensive Databases: Signature-based systems have large databases of known


malware, enabling quick identification of many threats.

3. Real-Time Protection: Many existing systems offer real-time scanning and


protection, catching threats as they appear.

4. User-Friendly: Most traditional antivirus solutions are easy to use, requiring


minimal user interaction.

Page | 26
3.3 Analysis of Proposed System
The proposed system aims to address the limitations of traditional malware detection
methods by incorporating automated analysis techniques that focus on behavioral patterns
and machine learning algorithms.

Key Features:

1. Static Analysis: Examines file attributes (such as PE header information and entropy)
without executing the file. This allows for quick analysis and identification of
suspicious attributes.
2. Dynamic Analysis (VirusTotal Integration): Submits files to VirusTotal for external
behavior-based analysis, adding a layer of behavioral insight that static analysis alone
cannot provide.
3. Machine Learning: A model (using Random Forest in this case) is trained on
extracted features to distinguish between malicious and benign files. This enables the
system to generalize patterns and detect previously unknown malware.

Challenges Addressed:
Detection of Previously Unknown Malware
Challenge: Traditional malware detection systems rely heavily on signature-based
detection, which can only identify known malware by matching against a database of
existing signatures. This approach is ineffective for new or modified malware with
unique signatures, as it requires prior knowledge of the threat.
Solution: The proposed system uses a machine learning model (Random Forest) trained
on features extracted from both malicious and benign files. This model can generalize
patterns in malware, enabling the system to detect previously unknown threats based on
learned patterns rather than specific signatures. This improves the system’s ability to
identify new variants of malware without the need for a signature update.

Page | 27
2. Resistance to Evasion Techniques
Challenge: Malware authors often employ obfuscation techniques, such as code packing,
encryption, or polymorphism, to hide malicious behavior and evade static detection
methods. These techniques make it difficult for signature-based systems to analyze and
detect such files accurately.
Solution: The proposed system incorporates static analysis, which includes calculating
entropy and examining the Portable Executable (PE) headers. High entropy scores can
indicate obfuscation or packing, which may signal the presence of malware. Additionally,
by combining this static analysis with behavior-based insights from VirusTotal and
machine learning classification, the system enhances its ability to identify malware even
when obfuscation techniques are used.
3. Reducing False Positives and Negatives
Challenge: Traditional detection systems, especially behavior-based ones, can generate
high rates of false positives (labeling benign files as malicious) and false negatives
(failing to detect actual malware). False positives can erode user trust, while false
negatives allow malicious files to go undetected, posing a security risk.
Solution: The proposed system reduces false positives and negatives by combining
multiple detection methods: static analysis, optional dynamic analysis via VirusTotal, and
machine learning classification. The integration of VirusTotal’s feedback provides
behavior-based indicators, while the machine learning model applies learned patterns to
improve classification accuracy. Together, these approaches create a more reliable and
balanced detection mechanism, enhancing overall detection accuracy and minimizing
erroneous classifications.

3.4 Advantages of Proposed System


The proposed system offers several advantages over existing methods:

Page | 28
1. Increased Resilience to Evasion: The machine learning model is trained to
recognize patterns of obfuscation and other evasive techniques, making the system
more resilient to sophisticated threats.

2. Scalability: The system can be scaled to handle large volumes of data and adapt to
new types of malware.

3. Real-Time Analysis: The proposed system can analyze files in real-time,


providing immediate feedback on potential threats.

4. Lower Resource Consumption: Efficient algorithms will ensure that the system
does not overly burden the system’s resources.

3.5 Design of Proposed System


3.5.1 Methodology of Proposed System
The methodology for the proposed system involves several key steps:

1. Data Collection: Gather a dataset of benign and malicious executables from


reliable sources. This dataset will be used to train and test the machine learning
model.

2. Feature Extraction: Analyze the collected executables to extract features that are
indicative of malicious or benign behavior. This could include API calls, file
access patterns, network activity, etc.

3. Model Training: Use the extracted features to train a machine learning model
(e.g., Random Forest). The model will learn to distinguish between malicious and
benign executables based on the features.

4. System Implementation: Develop the system to integrate the trained model,


allowing it to analyze new executables and classify them automatically.

Page | 29
5. Evaluation and Optimization: Test the system against a separate dataset to
evaluate its performance. Fine-tune the model and system to improve accuracy and
reduce false positives.

Page | 30
3.5.2 Flowchart of Proposed System

Start

Input Data (File


Program)

Feature Analysis

Classificatio
n (Legitimate
or Malicious)

Report Generation

End Page | 31
3.5.3 Use Case Diagram of Proposed System
1. Scan File: The user initiates a scan of an executable file.

2. Static Analysis Module: Analyzes executables before execution to extract static


features.

3. Dynamic Analysis Module: Analyzes executables during runtime to extract


dynamic features.

4. Machine Learning Model: Receives features extracted from static and dynamic
analysis and classifies the sample.

5. View Report: The user views the analysis report generated by the system.

No malware Detected.

Malware Detected in dynamic analysis phase.

Page | 32
3.5.4 Database Structure of the Proposed System
If a database is needed in my program:

1. Executables

2. Results

3. Models

4. Logs

3.5.5 Architecture of Proposed System


1. Presentation Layer: The user interface, where users interact with the system. This
could be a graphical user interface (GUI) or a command-line interface (CLI). For
this project I have used CLI.

2. Business Logic Layer: This is where the core processing happens. It includes the
feature extraction, machine learning model, and analysis logic.

3. Data Layer: If a database is used, it will be managed here. This layer handles the
storage and retrieval of data such as executable features, analysis results, and logs.

Page | 33
CHAPTER FOUR

IMPLEMENTATION

4.1 Hardware requirement


This system doesn't require high-performance hardware. However, for smooth
performance, the following minimum specifications are recommended:
Processor: Multi-core processor (e.g., Intel Core i5 or AMD equivalent)
RAM: At least 4 GB. 8GB is Recommended for better performance.
Storage: 256 GB SSD for efficient file handling
Internet Connection: Required for API access

4.2 Software requirement


This section outlines the necessary software and libraries required to run the malware
detection script:
Operating System: Works on any OS that supports Python (e.g., Windows, Linux, or
macOS) but can only scan windows executable files.
Python: Version 3.6 or higher
Libraries used:
pefile: For parsing PE (Portable Executable) files.
hashlib: For computing MD5 and SHA256 hashes of files.
requests: For interacting with the API.
json: For handling JSON data from API responses.
os: To check file paths and existence.
scikit-learn: used for machine learning.
time: For handling delays while waiting for API responds.

4.3 Output
The script uses print statements to display analysis results, structured in JSON format, for
easier readability and log storage.

Page | 34
Final outputs:

Page | 35
CHAPTER FIVE

SUMMARY, CONCLUSION AND RECOMMENDATIONS


5.1 Summary
This project set out to develop an automated malware detection system by combining
static analysis, dynamic analysis, and machine learning to increase detection accuracy
and robustness. The methodology began with data collection from malware repositories,
using pefile to extract structural details from Portable Executable (PE) files, such as
headers, entry points, and sections, to support static analysis. Additionally, hashlib was
employed to generate MD5 and SHA256 hashes, enabling cross-referencing against
known malware signatures for further static verification.
For dynamic analysis, the system utilized API responses to retrieve behavior-based
insights on executable files, allowing the system to identify suspicious runtime behaviors.
This step incorporated the requests and json libraries to streamline API integration and
support real-time data retrieval.
The project’s machine learning component, a Random Forest classifier, was trained on
these extracted features, learning to distinguish between malicious and benign
executables. This model serves as an extra layer of security, capturing complex patterns
that static and dynamic methods may miss. The end result is a robust, multi-layered
malware detection system that enhances traditional methods with adaptive machine
learning for a comprehensive approach to cybersecurity.

5.2 Conclusion
The implementation of static, dynamic, and machine learning-based methods has proven
effective in developing an automated malware detection system. By combining these
techniques, the system can perform comprehensive analysis, offering high detection rates
and low false positives. This layered approach demonstrates the potential of integrating
machine learning with traditional malware detection methods, providing a stronger
defense against evolving cyber threats.

Page | 36
5.3 Recommendations
Future work could consider the following recommendations:
1. Enhance Dataset Scope: Expand the dataset to include a broader range of malware
types and variations for improved detection robustness.
2. Model Improvements: To experiment with additional machine learning models, such
as neural networks, to further optimize detection accuracy.
3. System Scalability: Explore methods to reduce system resource requirements,
ensuring efficient performance even in resource-constrained environments.
4. User Education: Encourage regular updates and provide user guidelines on safe file
handling to complement the system's effectiveness.

5.4 Contribution to Knowledge


This project contributes to the field of malware detection by showcasing a multi-layered
approach that integrates static and dynamic analysis with machine learning. The
integration of the pefile, hashlib, requests, and json libraries with a Random Forest model
demonstrates a practical method for detecting malware with high precision. This
approach can serve as a framework for future developments in automated malware
detection, offering insights into efficient and accurate threat detection.
Overall, this project provides a practical, adaptable framework for automated malware
detection that can be built upon by future researchers and practitioners seeking to
advance security in cybersecurity systems.

Page | 37
REFERENCES
Advanced Malware Analysis, CISA “Cybersecurity and Infrastructure Security Agency
(CISA) - Malware Analysis”

Alazab, M., Venkatraman, S., Watters, P., & Alazab, M. (2013). "Zero-day malware
detection based on supervised learning algorithms of API call signatures." Proceedings of
the Ninth Australasian Data Mining Conference (AusDM 2011), Ballarat, Australia.

Anderson, H. S., & Roth, P. (2018). "Ember: An open dataset for training static PE
malware machine learning models." arXiv preprint arXiv:1804.04637.

Anderson, H. S., & Roth, P. (2018). "Ember: An open dataset for training static PE
malware machine learning models." arXiv preprint arXiv:1804.04637.

Bayer, U., Kruegel, C., & Kirda, E. (2009). "TTAnalyze: A Tool for Analyzing
Malware." Proceedings of the 15th European Conference on Research in Computer
Security (ESORICS 2009).

Bilge, L., & Dumitras, T. (2012). "Before we knew it: An empirical study of zero-day
attacks in the real world." Proceedings of the 2012 ACM conference on Computer and
communications security.

Christodorescu, M., Jha, S., Seshia, S. A., Song, D., & Bryant, R. E. (2005). "Semantics-
aware malware detection." Proceedings of the 2005 IEEE Symposium on Security and
Privacy (S&P).

Denning, D. E. (1987). "An Intrusion-Detection Model." IEEE Transactions on Software


Engineering, 13(2), 222-232.

Dinaburg, A., Royal, P., Sharif, M., & Lee, W. (2008). "Ether: malware analysis via
hardware virtualization extensions." Proceedings of the 15th ACM conference on
Computer and communications security.

Page | 38
Egele, M., Scholte, T., Kirda, E., & Kruegel, C. (2008). "A survey on automated dynamic
malware-analysis techniques and tools." ACM computing surveys (CSUR), 44(2), 1-42.

Faruki, P., Bharmal, A., Laxmi, V., Ganmoor, V., Gaur, M. S., Conti, M., & Rajarajan,
M. (2014). "Android security: A survey of issues, malware penetration, and defenses."
IEEE Communications Surveys & Tutorials, 17(2), 998-1022.

Khan, M. H., & Khan, I. R. (2017). Malware Detection and Analysis. International
Journal of Advanced Research in Computer Science, 8(5), 1147-1149. Retrieved from
[IJARCS](https://www.ijarcs.info).
Kleber, D., & Rios, D. (2019). Automated Malware Analysis: A Python Approach.
Python Security Handbook.
Namanya, A.P., Pagna-Disso, J.F., & Awan, I. (2015). Evaluation of Automated Static
Analysis Tools for Malware Detection in Portable Executable Files. Proceedings of the
31st UK Performance Engineering Workshop, University of Leeds, UK.

Raff, E., Barker, J., Sylvester, J., Brandon, R., Catanzaro, B., & Nicholas, C. (2020).
"Malware detection by eating a whole EXE." Proceedings of the AAAI Conference on
Artificial Intelligence, 34(01), 772-780.

Talukder, S. (2020). Tools and Techniques for Malware Detection and Analysis.
Retrieved from [ResearchGate]
(https://www.researchgate.net/publication/339301928_Tools_and_Techniques_for_Malw
are_Detection_and_Analysis).
Talukder, S. (2020). Tools and Techniques for Malware Detection and Analysis.
Retrieved from ResearchGate

Vinod, P., Jaipur, R., Laxmi, V., & Gaur, M. S. (2009). "Survey on malware detection
methods." Proceedings of the 3rd Hackers Workshop on Computer and Internet Security,
74-79.

Zeltser, L. (2015). "Malware Analysis: An Introduction." SANS Institute InfoSec Reading


Room.
Page | 39
Page | 40
APPEDIX A: SOURCE CODE
import os

import pefile

import hashlib

import requests

import json

import time

import array

import math

import pickle

# VirusTotal API Key

VIRUSTOTAL_API_KEY =
'0cdf5ea19525595cad7bc77494383e73e9c659c77d2d1e1f224cf73e9f0869bc'

# Check if file is executable

def is_executable(file_path):

return file_path.endswith('.exe') and os.path.isfile(file_path)

# Static Analysis Functions

def load_executable(file_path):

try:

Page | 41
pe = pefile.PE(file_path)

return pe

except pefile.PEFormatError:

print("Error: Not a valid PE file.")

return None

def calculate_hash(file_path):

with open(file_path, 'rb') as f:

file_data = f.read()

md5_hash = hashlib.md5(file_data).hexdigest()

sha256_hash = hashlib.sha256(file_data).hexdigest()

return md5_hash, sha256_hash

def print_basic_info(pe):

info = {

"Entry Point": hex(pe.OPTIONAL_HEADER.AddressOfEntryPoint),

"Image Base": hex(pe.OPTIONAL_HEADER.ImageBase),

"Number of Sections": pe.FILE_HEADER.NumberOfSections

return info

Page | 42
def print_sections(pe):

return [{"Name": section.Name.decode().strip(), "Virtual Address":


hex(section.VirtualAddress), "Size": section.SizeOfRawData} for section in pe.sections]

def print_imports(pe):

imports_info = []

if hasattr(pe, 'DIRECTORY_ENTRY_IMPORT'):

for entry in pe.DIRECTORY_ENTRY_IMPORT:

dll_info = {"DLL": entry.dll.decode(), "Functions": []}

for func in entry.imports:

dll_info["Functions"].append({

"Function": func.name.decode() if func.name else "N/A",

"Address": hex(func.address)

})

imports_info.append(dll_info)

return imports_info

# Dynamic Analysis Functions

def submit_to_virustotal(file_path):

url = "https://www.virustotal.com/vtapi/v2/file/scan"

params = {"apikey": VIRUSTOTAL_API_KEY}

with open(file_path, 'rb') as f:


Page | 43
files = {"file": f}

response = requests.post(url, files=files, params=params)

if response.status_code == 200:

resource = response.json().get("resource")

print("[+] File submitted to VirusTotal. Resource ID:", resource)

return resource

print("[-] Failed to submit file for dynamic analysis.")

return None

def get_virustotal_report(resource):

url = "https://www.virustotal.com/vtapi/v2/file/report"

params = {"apikey": VIRUSTOTAL_API_KEY, "resource": resource}

time.sleep(30) # Adjust time as needed

response = requests.get(url, params=params)

if response.status_code == 200:

return response.json()

print("[-] Failed to retrieve report.")

return None

def get_entropy(data):

if len(data) == 0:

Page | 44
return 0.0

occurrences = array.array('L', [0] * 256)

for x in data:

occurrences[x if isinstance(x, int) else ord(x)] += 1

entropy = 0

for x in occurrences:

if x:

p_x = float(x) / len(data)

entropy -= p_x * math.log(p_x, 2)

return entropy

def get_resources(pe):

resources = []

if hasattr(pe, 'DIRECTORY_ENTRY_RESOURCE'):

try:

for resource_type in pe.DIRECTORY_ENTRY_RESOURCE.entries:

if hasattr(resource_type, 'directory'):

for resource_id in resource_type.directory.entries:

if hasattr(resource_id, 'directory'):

Page | 45
for resource_lang in resource_id.directory.entries:

data = pe.get_data(resource_lang.data.struct.OffsetToData,

resource_lang.data.struct.Size)

size = resource_lang.data.struct.Size

entropy = get_entropy(data)

resources.append([entropy, size])

except Exception as e:

return resources

return resources

def get_version_info(pe):

"""Return version info's"""

res = {}

for fileinfo in pe.FileInfo:

if fileinfo.Key == 'StringFileInfo':

for st in fileinfo.StringTable:

for entry in st.entries.items():

res[entry[0]] = entry[1]

if fileinfo.Key == 'VarFileInfo':

for var in fileinfo.Var:

Page | 46
res[var.entry.items()[0][0]] = var.entry.items()[0][1]

if hasattr(pe, 'VS_FIXEDFILEINFO'):

res['flags'] = pe.VS_FIXEDFILEINFO.FileFlags

res['os'] = pe.VS_FIXEDFILEINFO.FileOS

res['type'] = pe.VS_FIXEDFILEINFO.FileType

res['file_version'] = pe.VS_FIXEDFILEINFO.FileVersionLS

res['product_version'] = pe.VS_FIXEDFILEINFO.ProductVersionLS

res['signature'] = pe.VS_FIXEDFILEINFO.Signature

res['struct_version'] = pe.VS_FIXEDFILEINFO.StrucVersion

return res

def extract_info(fpath):

res = {}

try:

pe = pefile.PE(fpath)

except pefile.PEFormatError:

return {}

res['Machine'] = pe.FILE_HEADER.Machine

res['SizeOfOptionalHeader'] = pe.FILE_HEADER.SizeOfOptionalHeader

res['Characteristics'] = pe.FILE_HEADER.Characteristics

res['MajorLinkerVersion'] = pe.OPTIONAL_HEADER.MajorLinkerVersion

Page | 47
res['MinorLinkerVersion'] = pe.OPTIONAL_HEADER.MinorLinkerVersion

res['SizeOfCode'] = pe.OPTIONAL_HEADER.SizeOfCode

res['SizeOfInitializedData'] = pe.OPTIONAL_HEADER.SizeOfInitializedData

res['SizeOfUninitializedData'] = pe.OPTIONAL_HEADER.SizeOfUninitializedData

res['AddressOfEntryPoint'] = pe.OPTIONAL_HEADER.AddressOfEntryPoint

res['BaseOfCode'] = pe.OPTIONAL_HEADER.BaseOfCode

try:

res['BaseOfData'] = pe.OPTIONAL_HEADER.BaseOfData

except AttributeError:

res['BaseOfData'] = 0

res['ImageBase'] = pe.OPTIONAL_HEADER.ImageBase

res['SectionAlignment'] = pe.OPTIONAL_HEADER.SectionAlignment

res['FileAlignment'] = pe.OPTIONAL_HEADER.FileAlignment

res['MajorOperatingSystemVersion'] =
pe.OPTIONAL_HEADER.MajorOperatingSystemVersion

res['MinorOperatingSystemVersion'] =
pe.OPTIONAL_HEADER.MinorOperatingSystemVersion

res['MajorImageVersion'] = pe.OPTIONAL_HEADER.MajorImageVersion

res['MinorImageVersion'] = pe.OPTIONAL_HEADER.MinorImageVersion

res['MajorSubsystemVersion'] = pe.OPTIONAL_HEADER.MajorSubsystemVersion

res['MinorSubsystemVersion'] = pe.OPTIONAL_HEADER.MinorSubsystemVersion

res['SizeOfImage'] = pe.OPTIONAL_HEADER.SizeOfImage

Page | 48
res['SizeOfHeaders'] = pe.OPTIONAL_HEADER.SizeOfHeaders

res['CheckSum'] = pe.OPTIONAL_HEADER.CheckSum

res['Subsystem'] = pe.OPTIONAL_HEADER.Subsystem

res['DllCharacteristics'] = pe.OPTIONAL_HEADER.DllCharacteristics

res['SizeOfStackReserve'] = pe.OPTIONAL_HEADER.SizeOfStackReserve

res['SizeOfStackCommit'] = pe.OPTIONAL_HEADER.SizeOfStackCommit

res['SizeOfHeapReserve'] = pe.OPTIONAL_HEADER.SizeOfHeapReserve

res['SizeOfHeapCommit'] = pe.OPTIONAL_HEADER.SizeOfHeapCommit

res['LoaderFlags'] = pe.OPTIONAL_HEADER.LoaderFlags

res['NumberOfRvaAndSizes'] = pe.OPTIONAL_HEADER.NumberOfRvaAndSizes

# Sections

res['SectionsNb'] = len(pe.sections)

entropy = list(map(lambda x: x.get_entropy(), pe.sections))

res['SectionsMeanEntropy'] = sum(entropy) / float(len(entropy))

res['SectionsMinEntropy'] = min(entropy)

res['SectionsMaxEntropy'] = max(entropy)

raw_sizes = list(map(lambda x: x.SizeOfRawData, pe.sections))

res['SectionsMeanRawsize'] = sum(raw_sizes) / float(len(raw_sizes))

res['SectionsMinRawsize'] = min(raw_sizes)

res['SectionsMaxRawsize'] = max(raw_sizes)

Page | 49
virtual_sizes = list(map(lambda x: x.Misc_VirtualSize, pe.sections))

res['SectionsMeanVirtualsize'] = sum(virtual_sizes) / float(len(virtual_sizes))

res['SectionsMinVirtualsize'] = min(virtual_sizes)

res['SectionMaxVirtualsize'] = max(virtual_sizes)

# Imports

try:

res['ImportsNbDLL'] = len(pe.DIRECTORY_ENTRY_IMPORT)

imports = sum([x.imports for x in pe.DIRECTORY_ENTRY_IMPORT], [])

res['ImportsNb'] = len(imports)

res['ImportsNbOrdinal'] = len(list(filter(lambda x: x.name is None, imports)))

except AttributeError:

res['ImportsNbDLL'] = 0

res['ImportsNb'] = 0

res['ImportsNbOrdinal'] = 0

# Exports

try:

res['ExportNb'] = len(pe.DIRECTORY_ENTRY_EXPORT.symbols)

except AttributeError:

# No export

Page | 50
res['ExportNb'] = 0

# Resources

resources = get_resources(pe)

res['ResourcesNb'] = len(resources)

if len(resources) > 0:

entropy = list(map(lambda x: x[0], resources))

res['ResourcesMeanEntropy'] = sum(entropy) / float(len(entropy))

res['ResourcesMinEntropy'] = min(entropy)

res['ResourcesMaxEntropy'] = max(entropy)

sizes = list(map(lambda x: x[1], resources))

res['ResourcesMeanSize'] = sum(sizes) / float(len(sizes))

res['ResourcesMinSize'] = min(sizes)

res['ResourcesMaxSize'] = max(sizes)

else:

res['ResourcesNb'] = 0

res['ResourcesMeanEntropy'] = 0

res['ResourcesMinEntropy'] = 0

res['ResourcesMaxEntropy'] = 0

res['ResourcesMeanSize'] = 0

res['ResourcesMinSize'] = 0

Page | 51
res['ResourcesMaxSize'] = 0

# Load configuration size

try:

res['LoadConfigurationSize'] = pe.DIRECTORY_ENTRY_LOAD_CONFIG.struct.Size

except AttributeError:

res['LoadConfigurationSize'] = 0

# Version configuration size

try:

version_info = get_version_info(pe)

res['VersionInformationSize'] = len(version_info.keys())

except AttributeError:

res['VersionInformationSize'] = 0

return res

def model_predict(file_path):

model = joblib.load("model/model.pkl")

features = pickle.loads(open(os.path.join('model/features.pkl'), 'rb').read())

data = extract_info(file_path)

if data:

Page | 52
pe_features = list(map(lambda x: data[x], features))

result = model.predict([pe_features])[0]

return "Malware" if result == 1 else "No Malware Detected!"

return "Analysis Incomplete"

# Complete Analysis

def analyze_file(file_path):

if is_executable(file_path):

print("\n[+] Performing Static Analysis...")

pe = load_executable(file_path)

if pe:

static_report = {

"File Path": file_path,

"MD5 Hash": calculate_hash(file_path)[0],

"SHA256 Hash": calculate_hash(file_path)[1],

"Basic Info": print_basic_info(pe),

"Sections": print_sections(pe),

"Imports": print_imports(pe)

print(json.dumps(static_report, indent=4))

Page | 53
print("\n[+] Performing Dynamic Analysis with VirusTotal...")

resource = submit_to_virustotal(file_path)

dynamic_report = get_virustotal_report(resource) if resource else {}

if dynamic_report:

positives = dynamic_report.get("positives", 0)

dynamic_result = "Malware" if positives > 0 else "No Malware Detected"

print(json.dumps(dynamic_report, indent=4))

else:

dynamic_result = "Analysis Incomplete"

print("\n[+] Running Model-based Analysis...")

model_result = model_predict(file_path)

# Final Summary

print("\n========== Analysis Summary ==========")

print("Static Analysis:", "Executable loaded successfully" if pe else "Failed to load PE


file")

print("Dynamic Analysis:", dynamic_result)

print("Model-based Analysis:", model_result)

if dynamic_result == "Malware" or model_result == "Malware":

print("Final Result: Malware Detected")

else:
Page | 54
print("Final Result: No Malware Detected")

else:

print("[-] Static Analysis failed: Not a valid PE file.")

else:

print("[-] Invalid executable file format.")

def main():

file_path = input("Enter the path to the executable file: ")

analyze_file(file_path)

if __name__ == "__main__":

main()

Model Trainer code:

import pickle

import joblib

Page | 55
import numpy

import pandas

import sklearn.ensemble as ek

from sklearn.feature_selection import SelectFromModel

from sklearn.metrics import confusion_matrix

from sklearn.model_selection import train_test_split

# for dataset_1 -> sep=',', for dataset_2 -> sep='|'

dataset = pandas.read_csv('./datasets/dataset_1.csv', sep=',', low_memory=False)

# dataset.head()

# dataset.describe()

# dataset.groupby(dataset['legitimate']).size()

# data preprocessing

X = dataset.drop(['ID', 'md5', 'legitimate'], axis=1).values

y = dataset['legitimate'].values

# Features we need for DTs

extratrees = ek.ExtraTreesClassifier().fit(X, y)

model = SelectFromModel(extratrees, prefit=True)

Page | 56
X_new = model.transform(X)

nbfeatures = X_new.shape[1]

# print(nbfeatures)

X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=0.2)

features = []

index = numpy.argsort(extratrees.feature_importances_)[::-1][:nbfeatures]

for f in range(nbfeatures):

# print("%d. feature %s (%f)" % (f + 1, dataset.columns[2+index[f]],


extratrees.feature_importances_[index[f]]))

features.append(dataset.columns[2 + f])

model = ek.RandomForestClassifier(n_estimators=33)

model.fit(X_train, y_train)

score = model.score(X_test, y_test)

print("Accuracy:", (score * 100), '%')

joblib.dump(model, "model/model.pkl")

open('model/features.pkl', 'wb').write(pickle.dumps(features))

Page | 57
# False Positives and Negatives

res = model.predict(X_new)

mt = confusion_matrix(y, res)

print("False positive rate : %f %%" % ((mt[0][1] / float(sum(mt[0]))) * 100))

print('False negative rate : %f %%' % (mt[1][0] / float(sum(mt[1])) * 100))

APPENDIX B: SCREENSHOOT OF OUTPUT

Normal exe

Malware Detected

Page | 58
Page | 59
Contents
ACKNOWLEDGEMENT.................................................Error! Bookmark not defined.
ABSTRACT........................................................................Error! Bookmark not defined.
DEDICATION...................................................................Error! Bookmark not defined.
CHAPTER ONE................................................................Error! Bookmark not defined.
INTRODUCTION..............................................................Error! Bookmark not defined.
1.1 Background to the Study..........................................Error! Bookmark not defined.
1.2 Statement of Problem...............................................Error! Bookmark not defined.
1.3 Significance of the Study..........................................Error! Bookmark not defined.
1.4 Aim and Objectives...................................................Error! Bookmark not defined.
1.5 Definition of Terms...................................................Error! Bookmark not defined.
1.6 Limitations of Study.................................................Error! Bookmark not defined.
CHAPTER TWO...............................................................Error! Bookmark not defined.
LITERATURE REVIEW..................................................Error! Bookmark not defined.
2.1 Introduction..............................................................Error! Bookmark not defined.
2.2 The Evolution of Malware and the Need for Automated Analysis..............Error!
Bookmark not defined.
2.3 Techniques for Automated Malware Analysis.......Error! Bookmark not defined.
2.4 Challenges and Future Directions...........................Error! Bookmark not defined.
2.5 Overview of Malware Detection Techniques..........Error! Bookmark not defined.
2.7 Review of Related Literature...................................Error! Bookmark not defined.
CHAPTER THREE...........................................................Error! Bookmark not defined.
ANALYSIS AND DESIGN................................................Error! Bookmark not defined.
3.1 Analysis of Existing System......................................Error! Bookmark not defined.
3.2 Advantages of Existing System................................Error! Bookmark not defined.
3.3 Analysis of Proposed System....................................Error! Bookmark not defined.
3.4 Advantages of Proposed System..............................Error! Bookmark not defined.
3.5 Design of Proposed System......................................Error! Bookmark not defined.
3.5.1 Methodology of Proposed System.........................Error! Bookmark not defined.

Page | 60
3.5.2 Flowchart of Proposed System..............................Error! Bookmark not defined.
3.5.3 Use Case Diagram of Proposed System...........Error! Bookmark not defined.
3.5.4 Database Structure of the Proposed System........Error! Bookmark not defined.
3.5.5 Architecture of Proposed System.........................Error! Bookmark not defined.
CHAPTER FOUR..............................................................Error! Bookmark not defined.
IMPLEMENTATION.......................................................Error! Bookmark not defined.
4.1 Hardware requirement.............................................Error! Bookmark not defined.
4.2 Software requirement...............................................Error! Bookmark not defined.
4.3 Output........................................................................Error! Bookmark not defined.
CHAPTER FIVE...............................................................Error! Bookmark not defined.
SUMMARY, CONCLUSION AND RECOMMENDATIONS.....Error! Bookmark not
defined.
5.1 Summary...................................................................Error! Bookmark not defined.
5.2 Conclusion.................................................................Error! Bookmark not defined.
5.3 Recommendations.....................................................Error! Bookmark not defined.
5.4 Contribution to Knowledge......................................Error! Bookmark not defined.
REFERENCES...................................................................Error! Bookmark not defined.
Khan, M. H., & Khan, I. R. (2017). Malware Detection and Analysis. International
Journal of Advanced Research in Computer Science, 8(5), 1147-1149. Retrieved from
[IJARCS](https://www.ijarcs.info)...................................Error! Bookmark not defined.
Kleber, D., & Rios, D. (2019). Automated Malware Analysis: A Python Approach.
Python Security Handbook...............................................Error! Bookmark not defined.
Talukder, S. (2020). Tools and Techniques for Malware Detection and Analysis.
Retrieved from
[ResearchGate](https://www.researchgate.net/publication/339301928_Tools_and_Tech
niques_for_Malware_Detection_and_Analysis)...............Error! Bookmark not defined.
APPEDIX A: SOURCE CODE........................................Error! Bookmark not defined.
APPENDIX B: SCREENSHOOT OF OUTPUT.............Error! Bookmark not defined.

Page | 61

You might also like