Automated Malware Analysis Update
Automated Malware Analysis Update
BY
U2019/5570122
Page | 1
DECLARATION
I hereby declare that this project titled “Automated Malware Detection System” is the
result of my own research and work, carried out as part of my academic requirements
under the supervision of Dr Ugochi Okengwu. This project was developed with integrity
and adherence to academic and professional standards.
All sources of information, references, and data used have been duly acknowledged. This
work has not been submitted in any previous application for a degree or qualification, nor
has it been presented elsewhere for academic credit.
I affirm that this project represents my personal contributions to the field of
cybersecurity, and I am responsible for the work presented within.
You may customize this as needed, especially if your institution has specific
requirements for project declarations.
……..……………………… ………………………………
DR. UGOCHI OKENGWU DATE
(PROJECT SUPERVISOR)
Page | 2
Contents
ACKNOWLEDGEMENT.................................................Error! Bookmark not defined.
ABSTRACT........................................................................Error! Bookmark not defined.
DEDICATION...................................................................Error! Bookmark not defined.
CHAPTER ONE................................................................Error! Bookmark not defined.
INTRODUCTION..............................................................Error! Bookmark not defined.
1.1 Background to the Study..........................................Error! Bookmark not defined.
1.2 Statement of Problem...............................................Error! Bookmark not defined.
1.3 Significance of the Study..........................................Error! Bookmark not defined.
1.4 Aim and Objectives...................................................Error! Bookmark not defined.
1.5 Definition of Terms...................................................Error! Bookmark not defined.
1.6 Limitations of Study.................................................Error! Bookmark not defined.
CHAPTER TWO...............................................................Error! Bookmark not defined.
LITERATURE REVIEW..................................................Error! Bookmark not defined.
2.1 Introduction..............................................................Error! Bookmark not defined.
2.2 The Evolution of Malware and the Need for Automated Analysis..............Error!
Bookmark not defined.
2.3 Techniques for Automated Malware Analysis.......Error! Bookmark not defined.
2.4 Challenges and Future Directions...........................Error! Bookmark not defined.
2.5 Overview of Malware Detection Techniques..........Error! Bookmark not defined.
2.7 Review of Related Literature...................................Error! Bookmark not defined.
CHAPTER THREE...........................................................Error! Bookmark not defined.
ANALYSIS AND DESIGN................................................Error! Bookmark not defined.
3.1 Analysis of Existing System......................................Error! Bookmark not defined.
3.2 Advantages of Existing System................................Error! Bookmark not defined.
3.3 Analysis of Proposed System....................................Error! Bookmark not defined.
3.4 Advantages of Proposed System..............................Error! Bookmark not defined.
3.5 Design of Proposed System......................................Error! Bookmark not defined.
Page | 3
3.5.1 Methodology of Proposed System.........................Error! Bookmark not defined.
3.5.2 Flowchart of Proposed System..............................Error! Bookmark not defined.
3.5.3 Use Case Diagram of Proposed System...........Error! Bookmark not defined.
3.5.4 Database Structure of the Proposed System........Error! Bookmark not defined.
3.5.5 Architecture of Proposed System.........................Error! Bookmark not defined.
CHAPTER FOUR..............................................................Error! Bookmark not defined.
IMPLEMENTATION.......................................................Error! Bookmark not defined.
4.1 Hardware requirement.............................................Error! Bookmark not defined.
4.2 Software requirement...............................................Error! Bookmark not defined.
4.3 Output........................................................................Error! Bookmark not defined.
CHAPTER FIVE...............................................................Error! Bookmark not defined.
SUMMARY, CONCLUSION AND RECOMMENDATIONS.....Error! Bookmark not
defined.
5.1 Summary...................................................................Error! Bookmark not defined.
5.2 Conclusion.................................................................Error! Bookmark not defined.
5.3 Recommendations.....................................................Error! Bookmark not defined.
5.4 Contribution to Knowledge......................................Error! Bookmark not defined.
REFERENCES...................................................................Error! Bookmark not defined.
Khan, M. H., & Khan, I. R. (2017). Malware Detection and Analysis. International
Journal of Advanced Research in Computer Science, 8(5), 1147-1149. Retrieved from
[IJARCS](https://www.ijarcs.info)...................................Error! Bookmark not defined.
Kleber, D., & Rios, D. (2019). Automated Malware Analysis: A Python Approach.
Python Security Handbook...............................................Error! Bookmark not defined.
Talukder, S. (2020). Tools and Techniques for Malware Detection and Analysis.
Retrieved from
[ResearchGate](https://www.researchgate.net/publication/339301928_Tools_and_Tech
niques_for_Malware_Detection_and_Analysis)...............Error! Bookmark not defined.
APPEDIX A: SOURCE CODE........................................Error! Bookmark not defined.
APPENDIX B: SCREENSHOOT OF OUTPUT.............Error! Bookmark not defined.
Page | 4
CERTIFICATION
This is to certify that the project titled “Automated Malware Detection System” was
carried out by Krama Gideon Ozuzhim with Mat number u2019/5570122, in partial
fulfillment of the requirements for the award of Bachelor of Science in Computer Science
at University of Portharcourt.
This work is original and has been conducted under the supervision of Dr. Ugochi
Okengwu. It adheres to the academic standards and guidelines of the institution.
Page | 5
ACKNOWLEDGEMENT
I begin by giving thanks to God Almighty, whose guidance, wisdom, and unwavering
support have seen me through every challenge of this project. His grace has been my
source of strength, and I am deeply grateful for His presence in my life.
I would also like to express my heartfelt gratitude to my family for their constant love,
encouragement, and belief in me.
My sincere thanks go to my Head of Department, Dr. Mrs. Ugochi Okengwu, whose
expert guidance and support throughout the duration of this project have been invaluable.
Your dedication to excellence and commitment to nurturing students like me have made a
lasting impact in my life.
To everyone who has played a role in this project, whether directly or indirectly, I offer
my deepest appreciation. Your contributions have made this achievement possible.
Page | 6
ABSTRACT
This project presents an automated malware detection system designed to improve the
accuracy and speed of identifying malicious files. The system integrates three distinct
methods—static analysis, dynamic analysis, and machine learning to offer a robust,
multi-layered approach to malware detection. Static analysis is used to examine
executable files for specific attributes such as headers and hashes, allowing quick
identification of known threats. Dynamic analysis complements this by monitoring the
runtime behavior of executables to detect suspicious activities in real-time.
To further enhance detection capabilities, a Random Forest machine learning model was
trained on a dataset of benign and malicious samples. This model helps the system
rapidly identify previously unseen malware by learning patterns associated with
malicious behaviors. The combination of these techniques provides a more
comprehensive detection system that balances speed and accuracy, making it adaptable to
evolving cyber threats. This project demonstrates the effectiveness of a multi-faceted
approach in enhancing cybersecurity defenses and aims to contribute to the advancement
of automated malware detection technology.
Page | 7
DEDICATION
I dedicate this project to God, for His infinite wisdom, guidance, and strength, which
have carried me through every challenge and success along this journey. I also dedicate
this work to the next generation of cybersecurity specialists, whose passion and
commitment will continue to shape the future of digital security. May this project inspire
them to innovate, protect, and safeguard the digital world, advancing the fight against
cyber threats for a safer tomorrow.
Page | 8
CHAPTER ONE
INTRODUCTION
1.1 Background to the Study
The exponential growth in internet usage and software applications has simultaneously
increased the risk of cyber threats. Among these threats, malware poses a substantial risk,
as it is designed to damage, disrupt, or gain unauthorized access to computer systems.
Traditional antivirus programs detect malware by using predefined signature databases,
but the continuous emergence of new and obfuscated malware variants has rendered these
methods less effective. This study introduces an automated malware detection system that
integrates static and dynamic analysis with machine learning to improve the accuracy and
efficiency of malware identification. Traditional malware detection methods often rely on
manual inspection, where cybersecurity professionals analyze files for signs of malicious
activity. While effective in some cases, manual analysis is time-consuming and prone to
human error. It also struggles to keep pace with the rapidly evolving techniques used by
modern malware to avoid detection (Namanya et al., 2015). This has created an urgent
need for faster, more reliable ways to detect and respond to malware.
To address these challenges, automated malware analysis systems have been developed.
These systems use advanced algorithms and machine learning techniques to analyze
malicious files automatically, without the need for human intervention (Kleber & Rios,
2019). By automating the detection process, these tools can rapidly identify and respond
to malware, reducing the time it takes to mitigate threats and minimizing the risk of
damage to systems (Cuckoo Sandbox Project, 2021). Automated systems can also
analyze large volumes of data, improving overall security by catching threats that may be
missed during manual inspections.
Page | 9
contexts, like Nigeria, where unique cybersecurity challenges exist. As more threats
emerge, the continuous development of these automated tools remains a crucial aspect of
defending against evolving cyberattacks.
Page | 10
1.3 Significance of the Study
1. Improved Malware Detection Accuracy
Traditional malware detection methods, such as signature-based techniques, are
becoming increasingly ineffective due to the constant evolution of malware. By
combining static analysis, dynamic analysis, and machine learning, this study
provides a more accurate and comprehensive approach. Static analysis extracts file
features without execution, dynamic analysis observes behavior during runtime, and
machine learning enables the system to learn from historical data, identifying new and
previously unseen malware variants. This multi-layered approach significantly
improves detection accuracy compared to traditional methods.
2. Adaptability to Evolving Malware
One of the major challenges in malware detection is the continuous development of
new strains. Traditional methods struggle to detect modified or obfuscated malware.
The proposed system adapts to emerging threats by analyzing files from multiple
perspectives—structural (static) and behavioral (dynamic)—while the machine
learning component learns from evolving data. This adaptability allows the system to
maintain high detection rates even as malware changes.
3. Efficiency in Malware Identification
By automating the malware detection process, this study offers a more efficient
alternative to manual inspection. The use of automated static and dynamic analysis
significantly reduces the time required to identify malicious files. The machine
learning model, once trained, can classify new files in a fraction of the time it would
take for manual methods or traditional signature-based systems to process large
volumes of data. This efficiency makes the system a valuable tool for real-time
malware detection in both individual and enterprise-level security operations.
4. Contribution to Cybersecurity Research
This study contributes to the ongoing research in the field of cybersecurity by
proposing an innovative hybrid approach to malware detection. By integrating
multiple detection techniques, it pushes the boundaries of traditional detection
Page | 11
methods, highlighting the potential of combining behavioral analysis and machine
learning for more robust security systems. The findings of this study may inspire
future research and development of next-generation malware detection tools.
1.4 Aim and Objectives
Page | 12
detection to modern approaches involving behavior-based and machine learning methods.
The goal of malware detection is to minimize the impact of malicious software by
identifying it early and accurately.
3. Static Analysis
Static analysis is the examination of files or software without executing them. In malware
detection, static analysis inspects a file's structure—such as its code, metadata, imports,
and sections—searching for indicators of malicious behavior. Key techniques include
hash calculation, header analysis, and entropy calculation. Static analysis is fast and safe
since it doesn’t involve running potentially dangerous code. However, it has limitations,
particularly against obfuscation techniques that can disguise malware within files.
4. Dynamic Analysis
Unlike static analysis, dynamic analysis involves executing the file in a controlled
environment (e.g., sandbox or virtual machine) to observe its behavior. This method is
useful for detecting behaviors like unauthorized file changes, network connections, or
interactions with system processes that may indicate malware. Services like VirusTotal
facilitate dynamic analysis by providing access to virtual environments where malware
can be safely observed. Although dynamic analysis is highly effective for behavior-based
detection, it requires significant resources and time, making it less practical for large-
scale real-time detection.
5. Machine Learning
Machine learning (ML) is a branch of artificial intelligence that enables systems to learn
from data, improve over time, and make predictions or decisions without being explicitly
programmed. In malware detection, machine learning can analyze patterns in data to
classify files as benign or malicious. Machine learning models are trained on large
datasets, learning features that distinguish malware from legitimate files. By recognizing
these patterns, ML models can detect new malware variants, making this approach more
robust against evolving threats compared to traditional methods.
6. Feature Extraction
Page | 13
Feature extraction is the process of identifying and isolating specific attributes of a file
that are useful for distinguishing between malware and benign software. Features might
include file size, entropy, PE header information, and imported functions. In a machine
learning-based malware detection system, feature extraction is critical because it provides
the model with data points that it uses to detect malware patterns. Effective feature
selection and extraction improve model accuracy and efficiency, helping it to generalize
across different types of malware.
7. False Positive and False Negative
False positives and false negatives are important metrics in malware detection that reflect
the accuracy and reliability of the system:
False Positive: Occurs when a benign file is incorrectly classified as malware. High
false-positive rates can reduce trust in the detection system and lead to unnecessary
resource usage.
False Negative: Occurs when a malicious file is misclassified as benign, allowing it to
evade detection. False negatives are particularly dangerous because they allow malware
to bypass security measures and cause harm to the system.
Reducing both false positives and false negatives is cruci
Malware is constantly evolving, with new and more sophisticated versions being created
regularly. These new variants might have different characteristics or use advanced
techniques to avoid detection. As a result, the system developed for detecting malware
may struggle to identify these novel threats effectively. In other words, if the malware is
too new or too cleverly designed, it might slip through the system's defenses.
Automated malware detection systems are not perfect and can sometimes make errors.
There are two main types of errors I have noticed, which are:
False Positives: This happens when the system incorrectly identifies a legitimate,
harmless file as malicious. This can lead to unnecessary alerts or actions being taken
against non-threatening files.
False Negatives: This occurs when the system fails to detect an actual malicious file,
allowing it to go unnoticed. This means a real threat could bypass the system
undetected, potentially causing harm.
Page | 15
CHAPTER TWO
LITERATURE REVIEW
2.1 Introduction
The increasing sophistication and volume of malware attacks pose significant challenges
to cybersecurity. Traditional methods of malware detection, relying heavily on signature-
based approaches and manual analysis, are no longer sufficient to address the evolving
landscape of cyber threats. Consequently, automated malware analysis, particularly
through the development of Python programs, has gained traction as a method to enhance
threat detection and mitigation efforts. This literature review explores the existing
research on automated malware analysis, focusing on Python-based methods for
identifying malicious executables.
2.2 The Evolution of Malware and the Need for Automated Analysis
Malware, short for malicious software, encompasses a variety of harmful software types,
including viruses, worms, trojans, and ransomware. The rapid evolution of malware has
outpaced traditional detection methods. According to Vinod (2009), signature-based
detection, while effective against known threats, struggles with detecting new and
polymorphic malware variants. The need for automated, behavior-based detection
methods has thus become paramount.
Page | 16
2.3 Techniques for Automated Malware Analysis
Automated malware analysis can be broadly categorized into static and dynamic analysis
techniques. Static analysis involves examining the executable without running it,
focusing on aspects such as file structure, strings, and binary patterns. Dinaburg . (2008)
highlighted the effectiveness of static analysis in identifying malware characteristics by
analyzing code signatures, while Faruki . (2014) discussed the limitations of static
analysis in detecting obfuscated or packed malware.
Dynamic analysis, on the other hand, involves executing the malware in a controlled
environment (sandbox) to observe its behavior. Research by Egele . (2008) demonstrated
that dynamic analysis can effectively identify malicious behavior patterns, such as
network activity and system modifications, which are often missed by static methods.
Python, with tools like Cuckoo Sandbox, has been instrumental in advancing dynamic
analysis techniques by automating the observation and recording of malware behavior in
real-time.
Recent studies have focused on incorporating machine learning (ML) techniques into
automated malware analysis. Anderson and Roth (2018) explored the application of
machine learning models, such as decision trees and neural networks, to classify
executables as malicious or benign based on features extracted from both static and
dynamic analysis. Their study demonstrated that ML models could significantly enhance
detection accuracy, particularly when trained on large datasets of labeled malware
samples.
Raff . (2020) further advanced this field by introducing deep learning techniques for
malware detection. Their research showed that deep neural networks could automatically
learn complex features from raw binary data, reducing the need for manual feature
engineering. Python's ML libraries, including TensorFlow and PyTorch, have facilitated
Page | 17
these advancements by providing robust frameworks for implementing and
experimenting with deep learning models.
Moreover, the reliance on labeled datasets for training machine learning models presents
a challenge, as labeling large volumes of malware samples is time-consuming and
requires expert knowledge. Future research should focus on developing unsupervised or
semi-supervised learning techniques that can detect novel malware without extensive
labeled datasets.
Page | 18
such as Scikit-learn and TensorFlow, have been instrumental in developing
models that learn from benign and malicious behavior to identify anomalies.
Page | 19
3. Hybrid Analysis: Hybrid analysis combines static and dynamic methods to
provide a more comprehensive evaluation of a potential threat. Research by
Anderson . (2011) demonstrated that combining the strengths of both approaches
leads to higher detection rates and better accuracy in classifying malware.
The integration of machine learning into malware detection has revolutionized the field,
enabling more sophisticated analysis and classification of threats.
Page | 20
Python’s versatility allows researchers to implement a range of unsupervised
algorithms, such as k-means clustering, for these purposes.
Future Directions
The future of automated malware analysis lies in addressing the challenges identified and
enhancing current methodologies. Research should focus on:
1. Increasing Evasion Resistance: The current system performs static and dynamic
analysis to identify malicious files, but sophisticated malware can evade detection by
using advanced obfuscation or encryption methods. Future work should focus on
enhancing the static analysis component to detect obfuscation more reliably, such as
by improving entropy-based analysis and incorporating additional PE header
attributes. Additionally, expanding the machine learning model to include features
derived from both static and dynamic analyses could help the system learn more
complex patterns of evasion.
2. Enhancing Model Interpretability and Reducing Data Dependency: The current
machine learning model is trained on labeled datasets, but acquiring and labeling large
datasets can be challenging. Future research could explore feature reduction or
advanced transfer learning to generalize from smaller datasets. Additionally, model
interpretability tools could be integrated to understand which features most influence
Page | 22
predictions, aiding in transparency and debugging, especially when dealing with high-
stakes malware detection.
3. Refining Hybrid Detection Approaches: The combination of static, dynamic, and
machine learning-based approaches in this system is an effective hybrid model. Future
enhancements could involve implementing a more seamless integration between static
and dynamic features, allowing the model to consider behavioral indicators alongside
structural attributes. This hybrid approach would make the system more robust,
particularly in identifying malware that exhibits suspicious behavior patterns post-
execution.
Functionality: The paper by Khan and Khan (2017) provides an in-depth exploration of
various techniques for malware detection and analysis, focusing on both static and
dynamic analysis methods. Static analysis involves inspecting the code without executing
it, while dynamic analysis involves running the malware in a controlled environment to
observe its behavior. The study introduces a framework designed to simplify malware
detection by allowing users to analyze binary files using Python scripts. The framework
enables users to scan through complex code to extract valuable information about the
malware’s structure and behavior.
Problem Solved: This paper addresses the growing need for effective tools to detect and
analyze increasingly sophisticated malware. By leveraging both static and dynamic
analysis techniques, the framework proposed in the paper helps users understand the
potential impact of malware on a system and develop appropriate signatures for detecting
malware infections on networks. The use of Python in the framework allows for
automation and simplification of the malware analysis process, making it more accessible
to users with varying levels of technical expertise.
Page | 23
Identified Gap: While the framework presented in the paper is comprehensive, it does
not specifically address the challenges faced in regions with limited cybersecurity
infrastructure, such as Nigeria. Additionally, the framework's reliance on general
malware signatures may not be fully effective in detecting region-specific threats that are
prevalent in countries like Nigeria. There is also a lack of emphasis on integrating
localized datasets into the analysis process, which is crucial for improving detection
accuracy in specific regions.
Proposed Solution:
My project can address these gaps by developing an automated malware detection tool
tailored for the Nigerian context. This tool would combine both static and dynamic
analysis methods, optimized for low-resource environments. By integrating a localized
dataset of Nigerian malware samples, the tool can improve detection accuracy and
relevance, making it more effective in addressing the unique cybersecurity challenges
faced in Nigeria.
Page | 24
CHAPTER THREE
Current Systems:
Limitations:
1. Slow Response to New Threats: Signature updates are required to detect new
malware, leading to a window of vulnerability.
Page | 25
3.2 Advantages of Existing System
1. Wide Adoption: Antivirus software is widely used, making it a first line of
defense for many users.
Page | 26
3.3 Analysis of Proposed System
The proposed system aims to address the limitations of traditional malware detection
methods by incorporating automated analysis techniques that focus on behavioral patterns
and machine learning algorithms.
Key Features:
1. Static Analysis: Examines file attributes (such as PE header information and entropy)
without executing the file. This allows for quick analysis and identification of
suspicious attributes.
2. Dynamic Analysis (VirusTotal Integration): Submits files to VirusTotal for external
behavior-based analysis, adding a layer of behavioral insight that static analysis alone
cannot provide.
3. Machine Learning: A model (using Random Forest in this case) is trained on
extracted features to distinguish between malicious and benign files. This enables the
system to generalize patterns and detect previously unknown malware.
Challenges Addressed:
Detection of Previously Unknown Malware
Challenge: Traditional malware detection systems rely heavily on signature-based
detection, which can only identify known malware by matching against a database of
existing signatures. This approach is ineffective for new or modified malware with
unique signatures, as it requires prior knowledge of the threat.
Solution: The proposed system uses a machine learning model (Random Forest) trained
on features extracted from both malicious and benign files. This model can generalize
patterns in malware, enabling the system to detect previously unknown threats based on
learned patterns rather than specific signatures. This improves the system’s ability to
identify new variants of malware without the need for a signature update.
Page | 27
2. Resistance to Evasion Techniques
Challenge: Malware authors often employ obfuscation techniques, such as code packing,
encryption, or polymorphism, to hide malicious behavior and evade static detection
methods. These techniques make it difficult for signature-based systems to analyze and
detect such files accurately.
Solution: The proposed system incorporates static analysis, which includes calculating
entropy and examining the Portable Executable (PE) headers. High entropy scores can
indicate obfuscation or packing, which may signal the presence of malware. Additionally,
by combining this static analysis with behavior-based insights from VirusTotal and
machine learning classification, the system enhances its ability to identify malware even
when obfuscation techniques are used.
3. Reducing False Positives and Negatives
Challenge: Traditional detection systems, especially behavior-based ones, can generate
high rates of false positives (labeling benign files as malicious) and false negatives
(failing to detect actual malware). False positives can erode user trust, while false
negatives allow malicious files to go undetected, posing a security risk.
Solution: The proposed system reduces false positives and negatives by combining
multiple detection methods: static analysis, optional dynamic analysis via VirusTotal, and
machine learning classification. The integration of VirusTotal’s feedback provides
behavior-based indicators, while the machine learning model applies learned patterns to
improve classification accuracy. Together, these approaches create a more reliable and
balanced detection mechanism, enhancing overall detection accuracy and minimizing
erroneous classifications.
Page | 28
1. Increased Resilience to Evasion: The machine learning model is trained to
recognize patterns of obfuscation and other evasive techniques, making the system
more resilient to sophisticated threats.
2. Scalability: The system can be scaled to handle large volumes of data and adapt to
new types of malware.
4. Lower Resource Consumption: Efficient algorithms will ensure that the system
does not overly burden the system’s resources.
2. Feature Extraction: Analyze the collected executables to extract features that are
indicative of malicious or benign behavior. This could include API calls, file
access patterns, network activity, etc.
3. Model Training: Use the extracted features to train a machine learning model
(e.g., Random Forest). The model will learn to distinguish between malicious and
benign executables based on the features.
Page | 29
5. Evaluation and Optimization: Test the system against a separate dataset to
evaluate its performance. Fine-tune the model and system to improve accuracy and
reduce false positives.
Page | 30
3.5.2 Flowchart of Proposed System
Start
Feature Analysis
Classificatio
n (Legitimate
or Malicious)
Report Generation
End Page | 31
3.5.3 Use Case Diagram of Proposed System
1. Scan File: The user initiates a scan of an executable file.
4. Machine Learning Model: Receives features extracted from static and dynamic
analysis and classifies the sample.
5. View Report: The user views the analysis report generated by the system.
No malware Detected.
Page | 32
3.5.4 Database Structure of the Proposed System
If a database is needed in my program:
1. Executables
2. Results
3. Models
4. Logs
2. Business Logic Layer: This is where the core processing happens. It includes the
feature extraction, machine learning model, and analysis logic.
3. Data Layer: If a database is used, it will be managed here. This layer handles the
storage and retrieval of data such as executable features, analysis results, and logs.
Page | 33
CHAPTER FOUR
IMPLEMENTATION
4.3 Output
The script uses print statements to display analysis results, structured in JSON format, for
easier readability and log storage.
Page | 34
Final outputs:
Page | 35
CHAPTER FIVE
5.2 Conclusion
The implementation of static, dynamic, and machine learning-based methods has proven
effective in developing an automated malware detection system. By combining these
techniques, the system can perform comprehensive analysis, offering high detection rates
and low false positives. This layered approach demonstrates the potential of integrating
machine learning with traditional malware detection methods, providing a stronger
defense against evolving cyber threats.
Page | 36
5.3 Recommendations
Future work could consider the following recommendations:
1. Enhance Dataset Scope: Expand the dataset to include a broader range of malware
types and variations for improved detection robustness.
2. Model Improvements: To experiment with additional machine learning models, such
as neural networks, to further optimize detection accuracy.
3. System Scalability: Explore methods to reduce system resource requirements,
ensuring efficient performance even in resource-constrained environments.
4. User Education: Encourage regular updates and provide user guidelines on safe file
handling to complement the system's effectiveness.
Page | 37
REFERENCES
Advanced Malware Analysis, CISA “Cybersecurity and Infrastructure Security Agency
(CISA) - Malware Analysis”
Alazab, M., Venkatraman, S., Watters, P., & Alazab, M. (2013). "Zero-day malware
detection based on supervised learning algorithms of API call signatures." Proceedings of
the Ninth Australasian Data Mining Conference (AusDM 2011), Ballarat, Australia.
Anderson, H. S., & Roth, P. (2018). "Ember: An open dataset for training static PE
malware machine learning models." arXiv preprint arXiv:1804.04637.
Anderson, H. S., & Roth, P. (2018). "Ember: An open dataset for training static PE
malware machine learning models." arXiv preprint arXiv:1804.04637.
Bayer, U., Kruegel, C., & Kirda, E. (2009). "TTAnalyze: A Tool for Analyzing
Malware." Proceedings of the 15th European Conference on Research in Computer
Security (ESORICS 2009).
Bilge, L., & Dumitras, T. (2012). "Before we knew it: An empirical study of zero-day
attacks in the real world." Proceedings of the 2012 ACM conference on Computer and
communications security.
Christodorescu, M., Jha, S., Seshia, S. A., Song, D., & Bryant, R. E. (2005). "Semantics-
aware malware detection." Proceedings of the 2005 IEEE Symposium on Security and
Privacy (S&P).
Dinaburg, A., Royal, P., Sharif, M., & Lee, W. (2008). "Ether: malware analysis via
hardware virtualization extensions." Proceedings of the 15th ACM conference on
Computer and communications security.
Page | 38
Egele, M., Scholte, T., Kirda, E., & Kruegel, C. (2008). "A survey on automated dynamic
malware-analysis techniques and tools." ACM computing surveys (CSUR), 44(2), 1-42.
Faruki, P., Bharmal, A., Laxmi, V., Ganmoor, V., Gaur, M. S., Conti, M., & Rajarajan,
M. (2014). "Android security: A survey of issues, malware penetration, and defenses."
IEEE Communications Surveys & Tutorials, 17(2), 998-1022.
Khan, M. H., & Khan, I. R. (2017). Malware Detection and Analysis. International
Journal of Advanced Research in Computer Science, 8(5), 1147-1149. Retrieved from
[IJARCS](https://www.ijarcs.info).
Kleber, D., & Rios, D. (2019). Automated Malware Analysis: A Python Approach.
Python Security Handbook.
Namanya, A.P., Pagna-Disso, J.F., & Awan, I. (2015). Evaluation of Automated Static
Analysis Tools for Malware Detection in Portable Executable Files. Proceedings of the
31st UK Performance Engineering Workshop, University of Leeds, UK.
Raff, E., Barker, J., Sylvester, J., Brandon, R., Catanzaro, B., & Nicholas, C. (2020).
"Malware detection by eating a whole EXE." Proceedings of the AAAI Conference on
Artificial Intelligence, 34(01), 772-780.
Talukder, S. (2020). Tools and Techniques for Malware Detection and Analysis.
Retrieved from [ResearchGate]
(https://www.researchgate.net/publication/339301928_Tools_and_Techniques_for_Malw
are_Detection_and_Analysis).
Talukder, S. (2020). Tools and Techniques for Malware Detection and Analysis.
Retrieved from ResearchGate
Vinod, P., Jaipur, R., Laxmi, V., & Gaur, M. S. (2009). "Survey on malware detection
methods." Proceedings of the 3rd Hackers Workshop on Computer and Internet Security,
74-79.
import pefile
import hashlib
import requests
import json
import time
import array
import math
import pickle
VIRUSTOTAL_API_KEY =
'0cdf5ea19525595cad7bc77494383e73e9c659c77d2d1e1f224cf73e9f0869bc'
def is_executable(file_path):
def load_executable(file_path):
try:
Page | 41
pe = pefile.PE(file_path)
return pe
except pefile.PEFormatError:
return None
def calculate_hash(file_path):
file_data = f.read()
md5_hash = hashlib.md5(file_data).hexdigest()
sha256_hash = hashlib.sha256(file_data).hexdigest()
def print_basic_info(pe):
info = {
return info
Page | 42
def print_sections(pe):
def print_imports(pe):
imports_info = []
if hasattr(pe, 'DIRECTORY_ENTRY_IMPORT'):
dll_info["Functions"].append({
"Address": hex(func.address)
})
imports_info.append(dll_info)
return imports_info
def submit_to_virustotal(file_path):
url = "https://www.virustotal.com/vtapi/v2/file/scan"
if response.status_code == 200:
resource = response.json().get("resource")
return resource
return None
def get_virustotal_report(resource):
url = "https://www.virustotal.com/vtapi/v2/file/report"
if response.status_code == 200:
return response.json()
return None
def get_entropy(data):
if len(data) == 0:
Page | 44
return 0.0
for x in data:
entropy = 0
for x in occurrences:
if x:
return entropy
def get_resources(pe):
resources = []
if hasattr(pe, 'DIRECTORY_ENTRY_RESOURCE'):
try:
if hasattr(resource_type, 'directory'):
if hasattr(resource_id, 'directory'):
Page | 45
for resource_lang in resource_id.directory.entries:
data = pe.get_data(resource_lang.data.struct.OffsetToData,
resource_lang.data.struct.Size)
size = resource_lang.data.struct.Size
entropy = get_entropy(data)
resources.append([entropy, size])
except Exception as e:
return resources
return resources
def get_version_info(pe):
res = {}
if fileinfo.Key == 'StringFileInfo':
for st in fileinfo.StringTable:
res[entry[0]] = entry[1]
if fileinfo.Key == 'VarFileInfo':
Page | 46
res[var.entry.items()[0][0]] = var.entry.items()[0][1]
if hasattr(pe, 'VS_FIXEDFILEINFO'):
res['flags'] = pe.VS_FIXEDFILEINFO.FileFlags
res['os'] = pe.VS_FIXEDFILEINFO.FileOS
res['type'] = pe.VS_FIXEDFILEINFO.FileType
res['file_version'] = pe.VS_FIXEDFILEINFO.FileVersionLS
res['product_version'] = pe.VS_FIXEDFILEINFO.ProductVersionLS
res['signature'] = pe.VS_FIXEDFILEINFO.Signature
res['struct_version'] = pe.VS_FIXEDFILEINFO.StrucVersion
return res
def extract_info(fpath):
res = {}
try:
pe = pefile.PE(fpath)
except pefile.PEFormatError:
return {}
res['Machine'] = pe.FILE_HEADER.Machine
res['SizeOfOptionalHeader'] = pe.FILE_HEADER.SizeOfOptionalHeader
res['Characteristics'] = pe.FILE_HEADER.Characteristics
res['MajorLinkerVersion'] = pe.OPTIONAL_HEADER.MajorLinkerVersion
Page | 47
res['MinorLinkerVersion'] = pe.OPTIONAL_HEADER.MinorLinkerVersion
res['SizeOfCode'] = pe.OPTIONAL_HEADER.SizeOfCode
res['SizeOfInitializedData'] = pe.OPTIONAL_HEADER.SizeOfInitializedData
res['SizeOfUninitializedData'] = pe.OPTIONAL_HEADER.SizeOfUninitializedData
res['AddressOfEntryPoint'] = pe.OPTIONAL_HEADER.AddressOfEntryPoint
res['BaseOfCode'] = pe.OPTIONAL_HEADER.BaseOfCode
try:
res['BaseOfData'] = pe.OPTIONAL_HEADER.BaseOfData
except AttributeError:
res['BaseOfData'] = 0
res['ImageBase'] = pe.OPTIONAL_HEADER.ImageBase
res['SectionAlignment'] = pe.OPTIONAL_HEADER.SectionAlignment
res['FileAlignment'] = pe.OPTIONAL_HEADER.FileAlignment
res['MajorOperatingSystemVersion'] =
pe.OPTIONAL_HEADER.MajorOperatingSystemVersion
res['MinorOperatingSystemVersion'] =
pe.OPTIONAL_HEADER.MinorOperatingSystemVersion
res['MajorImageVersion'] = pe.OPTIONAL_HEADER.MajorImageVersion
res['MinorImageVersion'] = pe.OPTIONAL_HEADER.MinorImageVersion
res['MajorSubsystemVersion'] = pe.OPTIONAL_HEADER.MajorSubsystemVersion
res['MinorSubsystemVersion'] = pe.OPTIONAL_HEADER.MinorSubsystemVersion
res['SizeOfImage'] = pe.OPTIONAL_HEADER.SizeOfImage
Page | 48
res['SizeOfHeaders'] = pe.OPTIONAL_HEADER.SizeOfHeaders
res['CheckSum'] = pe.OPTIONAL_HEADER.CheckSum
res['Subsystem'] = pe.OPTIONAL_HEADER.Subsystem
res['DllCharacteristics'] = pe.OPTIONAL_HEADER.DllCharacteristics
res['SizeOfStackReserve'] = pe.OPTIONAL_HEADER.SizeOfStackReserve
res['SizeOfStackCommit'] = pe.OPTIONAL_HEADER.SizeOfStackCommit
res['SizeOfHeapReserve'] = pe.OPTIONAL_HEADER.SizeOfHeapReserve
res['SizeOfHeapCommit'] = pe.OPTIONAL_HEADER.SizeOfHeapCommit
res['LoaderFlags'] = pe.OPTIONAL_HEADER.LoaderFlags
res['NumberOfRvaAndSizes'] = pe.OPTIONAL_HEADER.NumberOfRvaAndSizes
# Sections
res['SectionsNb'] = len(pe.sections)
res['SectionsMinEntropy'] = min(entropy)
res['SectionsMaxEntropy'] = max(entropy)
res['SectionsMinRawsize'] = min(raw_sizes)
res['SectionsMaxRawsize'] = max(raw_sizes)
Page | 49
virtual_sizes = list(map(lambda x: x.Misc_VirtualSize, pe.sections))
res['SectionsMinVirtualsize'] = min(virtual_sizes)
res['SectionMaxVirtualsize'] = max(virtual_sizes)
# Imports
try:
res['ImportsNbDLL'] = len(pe.DIRECTORY_ENTRY_IMPORT)
res['ImportsNb'] = len(imports)
except AttributeError:
res['ImportsNbDLL'] = 0
res['ImportsNb'] = 0
res['ImportsNbOrdinal'] = 0
# Exports
try:
res['ExportNb'] = len(pe.DIRECTORY_ENTRY_EXPORT.symbols)
except AttributeError:
# No export
Page | 50
res['ExportNb'] = 0
# Resources
resources = get_resources(pe)
res['ResourcesNb'] = len(resources)
if len(resources) > 0:
res['ResourcesMinEntropy'] = min(entropy)
res['ResourcesMaxEntropy'] = max(entropy)
res['ResourcesMinSize'] = min(sizes)
res['ResourcesMaxSize'] = max(sizes)
else:
res['ResourcesNb'] = 0
res['ResourcesMeanEntropy'] = 0
res['ResourcesMinEntropy'] = 0
res['ResourcesMaxEntropy'] = 0
res['ResourcesMeanSize'] = 0
res['ResourcesMinSize'] = 0
Page | 51
res['ResourcesMaxSize'] = 0
try:
res['LoadConfigurationSize'] = pe.DIRECTORY_ENTRY_LOAD_CONFIG.struct.Size
except AttributeError:
res['LoadConfigurationSize'] = 0
try:
version_info = get_version_info(pe)
res['VersionInformationSize'] = len(version_info.keys())
except AttributeError:
res['VersionInformationSize'] = 0
return res
def model_predict(file_path):
model = joblib.load("model/model.pkl")
data = extract_info(file_path)
if data:
Page | 52
pe_features = list(map(lambda x: data[x], features))
result = model.predict([pe_features])[0]
# Complete Analysis
def analyze_file(file_path):
if is_executable(file_path):
pe = load_executable(file_path)
if pe:
static_report = {
"Sections": print_sections(pe),
"Imports": print_imports(pe)
print(json.dumps(static_report, indent=4))
Page | 53
print("\n[+] Performing Dynamic Analysis with VirusTotal...")
resource = submit_to_virustotal(file_path)
if dynamic_report:
positives = dynamic_report.get("positives", 0)
print(json.dumps(dynamic_report, indent=4))
else:
model_result = model_predict(file_path)
# Final Summary
else:
Page | 54
print("Final Result: No Malware Detected")
else:
else:
def main():
analyze_file(file_path)
if __name__ == "__main__":
main()
import pickle
import joblib
Page | 55
import numpy
import pandas
import sklearn.ensemble as ek
# dataset.head()
# dataset.describe()
# dataset.groupby(dataset['legitimate']).size()
# data preprocessing
y = dataset['legitimate'].values
extratrees = ek.ExtraTreesClassifier().fit(X, y)
Page | 56
X_new = model.transform(X)
nbfeatures = X_new.shape[1]
# print(nbfeatures)
features = []
index = numpy.argsort(extratrees.feature_importances_)[::-1][:nbfeatures]
for f in range(nbfeatures):
features.append(dataset.columns[2 + f])
model = ek.RandomForestClassifier(n_estimators=33)
model.fit(X_train, y_train)
joblib.dump(model, "model/model.pkl")
open('model/features.pkl', 'wb').write(pickle.dumps(features))
Page | 57
# False Positives and Negatives
res = model.predict(X_new)
mt = confusion_matrix(y, res)
Normal exe
Malware Detected
Page | 58
Page | 59
Contents
ACKNOWLEDGEMENT.................................................Error! Bookmark not defined.
ABSTRACT........................................................................Error! Bookmark not defined.
DEDICATION...................................................................Error! Bookmark not defined.
CHAPTER ONE................................................................Error! Bookmark not defined.
INTRODUCTION..............................................................Error! Bookmark not defined.
1.1 Background to the Study..........................................Error! Bookmark not defined.
1.2 Statement of Problem...............................................Error! Bookmark not defined.
1.3 Significance of the Study..........................................Error! Bookmark not defined.
1.4 Aim and Objectives...................................................Error! Bookmark not defined.
1.5 Definition of Terms...................................................Error! Bookmark not defined.
1.6 Limitations of Study.................................................Error! Bookmark not defined.
CHAPTER TWO...............................................................Error! Bookmark not defined.
LITERATURE REVIEW..................................................Error! Bookmark not defined.
2.1 Introduction..............................................................Error! Bookmark not defined.
2.2 The Evolution of Malware and the Need for Automated Analysis..............Error!
Bookmark not defined.
2.3 Techniques for Automated Malware Analysis.......Error! Bookmark not defined.
2.4 Challenges and Future Directions...........................Error! Bookmark not defined.
2.5 Overview of Malware Detection Techniques..........Error! Bookmark not defined.
2.7 Review of Related Literature...................................Error! Bookmark not defined.
CHAPTER THREE...........................................................Error! Bookmark not defined.
ANALYSIS AND DESIGN................................................Error! Bookmark not defined.
3.1 Analysis of Existing System......................................Error! Bookmark not defined.
3.2 Advantages of Existing System................................Error! Bookmark not defined.
3.3 Analysis of Proposed System....................................Error! Bookmark not defined.
3.4 Advantages of Proposed System..............................Error! Bookmark not defined.
3.5 Design of Proposed System......................................Error! Bookmark not defined.
3.5.1 Methodology of Proposed System.........................Error! Bookmark not defined.
Page | 60
3.5.2 Flowchart of Proposed System..............................Error! Bookmark not defined.
3.5.3 Use Case Diagram of Proposed System...........Error! Bookmark not defined.
3.5.4 Database Structure of the Proposed System........Error! Bookmark not defined.
3.5.5 Architecture of Proposed System.........................Error! Bookmark not defined.
CHAPTER FOUR..............................................................Error! Bookmark not defined.
IMPLEMENTATION.......................................................Error! Bookmark not defined.
4.1 Hardware requirement.............................................Error! Bookmark not defined.
4.2 Software requirement...............................................Error! Bookmark not defined.
4.3 Output........................................................................Error! Bookmark not defined.
CHAPTER FIVE...............................................................Error! Bookmark not defined.
SUMMARY, CONCLUSION AND RECOMMENDATIONS.....Error! Bookmark not
defined.
5.1 Summary...................................................................Error! Bookmark not defined.
5.2 Conclusion.................................................................Error! Bookmark not defined.
5.3 Recommendations.....................................................Error! Bookmark not defined.
5.4 Contribution to Knowledge......................................Error! Bookmark not defined.
REFERENCES...................................................................Error! Bookmark not defined.
Khan, M. H., & Khan, I. R. (2017). Malware Detection and Analysis. International
Journal of Advanced Research in Computer Science, 8(5), 1147-1149. Retrieved from
[IJARCS](https://www.ijarcs.info)...................................Error! Bookmark not defined.
Kleber, D., & Rios, D. (2019). Automated Malware Analysis: A Python Approach.
Python Security Handbook...............................................Error! Bookmark not defined.
Talukder, S. (2020). Tools and Techniques for Malware Detection and Analysis.
Retrieved from
[ResearchGate](https://www.researchgate.net/publication/339301928_Tools_and_Tech
niques_for_Malware_Detection_and_Analysis)...............Error! Bookmark not defined.
APPEDIX A: SOURCE CODE........................................Error! Bookmark not defined.
APPENDIX B: SCREENSHOOT OF OUTPUT.............Error! Bookmark not defined.
Page | 61