Final Project Report'''1
Final Project Report'''1
PREVENTION USING ML
A PROJECT REPORT
Submitted by
ABINESH.S(721020104003)
KESAAV.NA(721020104024)
KIRUBAKARAN.V(721020104026)
of
BACHELOR OF ENGINEERING
in
COMPUTER SCIENCE AND ENGINEERING
MAY 2024
ANNA UNIVERSITY: CHENNAI 600 025
BONAFIDE CERTIFICATE
SIGNATURE SIGNATURE
Dr. S. Pathur Nisha, M.E., Ph.D., Dr.D. Sathish Kumar, M.E., Ph.D.,
Coimbatore 641105
Coimbatore 641105
ABINESH.S(721020104003)
KESAAV.NA(721020104024)
KIRUBAKARAN.V(721020104026)
ACKNOWLEDGEMENT
First of all, we thank the almighty for giving us knowledge and courage to complete this
dissertation work successfully. We express our sincere gratitude to Dr. P. Krishna
Kumar, MBA., Ph.D., CEO & Secretary for providing us the opportunity to carry out
Under Graduate program in this reputed institution showed towards us throughout the
course. We would also take the privilege to thank our respected Principal Dr. M.
Sivaraja, M.E., Ph.D., P.D. (USA), for being a source of inspiration throughout the
course.
We would also like to extend our sincerest gratitude to, Dr. S. Pathur Nisha, M.E., Ph.D.,
Head of the department, Department of Computer Science and Engineering, for her
constant motivation, and encouragement for us to carry out the project work in a
spectacular fashion. We also thank all the faculty members of our department for their
timely supportive role and big helping thanks in the process of accomplishment of our
work.
We would like to thank Our Project Guide, Dr.D. Sathish Kumar, M.E., Ph.D.,
Professor, Department of Computer Science and Engineering, for her encouragement and
support. We also acknowledge the valiant support of our lab technicians for extending
helping hands whenever it was required.
We finally thank our parents & friends for their constant encouragement during our
college possession.
ABSTRACT
The rapid technological advancement, security has become a major issue due to the
increase in malware activity that poses a serious threat to the security and safety of both
computer systems and stakeholders. To maintain stakeholder’s, particularly, end user’s
security, protecting the data from fraud ulentefforts is one of the most pressing concerns.
A set of malicious programming code, scripts, active content, or intrusive software that is
designed to destroy intended computer systems and programs or mobile and web
distinguish between malicious and benign applications. Thus, computer systems and
utilizing novel concepts including Artificial Intelligence, Machine Learning, and Deep
Learning. In this study, we emphasize Machine Learning based techniques for detecting
detection technologies, their shortcomings, and ways to improve efficiency. Our study
shows that adopting futuristic approaches for the development of malware detection
TABLE OF CONTENTS
LIST OF FIGURE
TABLE NO FIGURE NAME PAGE NO
CHAPTER 1
INTRODUCTION
1.1 DOMAIN
Cybersecurity information technology security or simply IT security, is the
practice of protecting computer systems, networks, and data from theft, damage, or
unauthorized access. It encompasses a range of technologies, processes, and practices
designed to ensure the confidentiality, integrity, and availability of information in the
digital realm. The importance of cybersecurity has grown significantly with the increasing
reliance on digital technologies in all aspects of society, including businesses,
governments, and individuals. As more data is stored and transmitted electronically, the
potential risks and threats to this data have also multiplied.
1.2 MACHINE LEARNING TECHNOQUES
Machine Learning Assigns labels to input data based on learning from labeled
examples. Popular algorithms include Support Vector Machines (SVM), Decision Trees,
Random Forests, and Neural Networks. Predicts continuous outcomes based on input data.
Linear Regression, Polynomial Regression, Support Vector Regression, and Neural
Networks are common regression techniques. Groups similar data points together based
on some similarity metric. K-Means, Hierarchical Clustering, and DBSCAN are popular
clustering algorithms. Reduces the number of features in the data while preserving most of
its variance. Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor
Embedding (t-SNE) are widely used for dimensionality reduction.ombines both labeled
and unlabeled data for training. It can be particularly useful when labeled data is scarce or
expensive to obtain.
CHAPTER 2
LITERATURE REVIEW
2.1 INTRODUCTION
Preventing ransomware is challenging for several reasons. The way ransomware
functions is the same as benign software, which acts covertly. Ransomware detection in
zero-day assaults is therefore crucial at this time. The primary objectives are to avoid
ransomware-caused system damage identify zero-day malware and minimize detection,
which means reducing the number of false positives while still detecting all instances of
ransomware. False positives are instances where the system flags a harmless program or
file as ransomware leading to unnecessary alerts and actions. Ransomware can be found
using a variety of tools and methodologies. Methods based on static analysis decompose
source code without running it. They generate many false positives and cannot find
ransomware that is disguised.
Attackers frequently create new variations and modify their codes using various
packaging techniques. To solve these issues, researchers use dynamic behavior analysis
methods that monitor interactions between the executed code and a virtual environment.
However, these detection methods are cumbersome and memory-intensive. Machine
learning is ideal for analyzing any process or application’s behavior. Machine learning is
considered ideal for analyzing the behavior of processes or applications because it can
effectively learn patterns and anomalies in large datasets, which can be difficult for
humans to detect.
CONCLUSION
Malware or malicious applications may cause catastrophic damages to not only computer
systems but also data centers, web, and mobile applications to various industries;
particularly, financial and healthcare institutes. Ensuring the safety of stakeholders’ data
from malicious entities is a major challenge that leads us towards the concept of malware
detection and prevention. Machine Learning(ML) can be an effective solution that we can
adopt for the development of Anti Malware Systems. Having such direction, this study
presented a detailed review of malware detection techniques and approaches. At first, we
attempted to provide a clear overview of malware, artificial intelligence, and its narration.
CHAPTER 3
SYSTEM ANALYSIS
3.1 EXISTING SYSTEM
Traditional antivirus software can detect known ransomware strains based on
signatures or behavioral patterns. However, it may struggle with new or evolving variants.
EDR solutions monitor endpoint devices for suspicious activities and behaviors indicative
of ransomware. They can detect anomalies in file access patterns, process behaviors, and
network communications. NGAV solutions use advanced techniques such as machine
learning, artificial intelligence, and behavioral analysis to identify and block ransomware
threats in real-time. They can detect both known and unknown ransomware variants.
Regular backups of critical data are essential for ransomware protection. Backup solutions
with features like versioning, encryption, and off-site storage can help organizations
recover from ransomware attacks without paying the ransom.
Train machine learning models on labeled datasets to classify normal and malicious
behavior. Use supervised learning algorithms like Random Forest, Support Vector
Machines (SVM), or deep learning models such as Convolutional Neural Networks
(CNNs) or Recurrent Neural Networks (RNNs).Continuously update and refine the
models with new data to improve detection accuracy and adapt to evolving ransomware
variants.
3.1.2 ADVANTAGES
• Early Detection, Reduced False Positives, Automated Response
• High Security.
CHAPTER 4
Ransomware Detection
4.1 Ransomware-Detection Methods
4.1.1 Description
The two main types of ransomware-detection methods are automated and manual.
Employing technologies to identify and report ransomware attacks is a prerequisite for
automated methods. These tools are typically software programs that have the potential to
be able to stop attacks. Techniques for manual detection focus on routinely scanning data
and devices for indicators of attacks. Checking to see if a malware attack has not modified
data or stopped authorized users from accessing their devices or files includes looking at
any changes to file extensions, the accessibility of devices and files by authorized users,
and any changes to file extensions.
The current methods for detecting ransomware primarily involve monitoring the
system at the file system level. Automated approaches to detecting ransomware can be
categorized into two main groups: those based on machine learning (ML) and those that
are not based on ML. ML-based methods typically employ machine learning (ML), deep
learning (DL), and artificial neural network (ANN) techniques to detect ransomware.
Some tools utilize variations of these techniques or a hybrid approach that combines two
or more techniques to combat the threat of ransomware attacks. Non-AI methods rely on
packet inspection and traffic analysis to detect ransomware. One of the major advantages
of automated approaches is their ability to detect, block, and recover from ransomware
attacks without human intervention. Additionally, these tools are highly accurate and
reliable in terms of detecting, preventing, and recovering from ransomware attacks.
Machine learning (ML) techniques, including machine learning, deep learning, and
artificial neural networks, have been utilized for automated ransomware detection. These
techniques involve the use of behavioral techniques, as well as static and dynamic
analysis, to identify and prevent ransomware attacks. Machine learning algorithms can
learn from previous ransomware attacks and detect new variants by analyzing patterns and
behaviors. On the other hand, deep learning methods can leverage neural networks to
detect ransomware attacks by analyzing large amounts of data. Artificial neural networks
can also be used to identify ransomware by processing and analyzing multiple data
sources. These ML-based approaches offer a more efficient and reliable way to detect and
prevent ransomware attacks, reducing the potential impact on businesses and individuals.
ML based detection has several benefits, including its ability to detect new or
unknown ransomware variants that do not match existing signatures or patterns and to
adapt to changing ransomware behavior pattern solver time. Moreover, this approach is
less prone to false positives than signature-based and heuristic-based detection, as it relies
on detecting actual behavior patterns rather than static code signatures or predefined rules.
However, machine-learning-based detection is limited by its reliance on a large and
representative dataset of training samples and by its susceptibility to adversarial attacks
that can manipulate the features or behavior of the ransomware to evade detection.
4.2 Ransomware-Detection Techniques
4.2.1 Description
Ransomware detection is a critical component of cybersecurity, and various techniques
have been developed to detect ransomware attacks. This section will discuss different
ransomware-detection techniques proposed in the literature and their strengths, weak
nesses, and limitations.
Ransomware detection is a more advanced approach that identifies ransomware behavior
patterns or anomalies indicative of malicious activity. This approach is based on creating
rules or heuristics that describe typical ransomware behavior and then monitoring the
system or network for any deviations or anomalies from these rules. If such varia tions or
abnormalities are detected the ransomware is flagged as suspicious or malicious, and
appropriate. One of the advantages of heuristic-based detection is its ability to detect new
or unknown ransomware variants that do not match any existing signatures or patterns.
Moreover, this approach is less prone to false positives than signature-based detection, as
it relies on detecting actual behavior patterns rather than static code signatures. However,
heuristic-based detection is limited by its reliance on predefined rules or heuristics, which
may only capture some possible ransomware behavior patterns or anomalies. Moreover,
attackers can easily evade heuristic-based detection by modifying the behavior of the
ransomware to avoid detection.
This is typically conducted prior to deploying the executable file on a production system.
The confusion between static and dynamic analysis may arise from the fact that both
approaches involve the analysis of executable files, but they do so in different ways. Static
analysis involves looking at the executable file’s source code to spot malicious activity,
while dynamic analysis involves running the executable file in a controlled environment
to observe its behavior. Dynamic analysis can be performed in real-time, but it can also be
conducted in a sandbox environment before deploying the executable file on a production
system. In a sandbox environment, the executable file is executed in a controlled
environment, allowing its behavior to be monitored and analyzed without affecting the
production system. Once the analysis is complete, the results can be used to determine
whether the executable file is malicious or benign.
4.4 Performance Evaluation of Machine Learning Models for Ransomware Detection
4.4.1 Description
Evaluating the performance of machine learning models for ransomware detection
is crucial to determine their effectiveness in detecting and preventing its spread. In this
section, we will discuss different evaluation metrics used for measuring the performance
of machine learning models for ransomware detection, including accuracy, precision,
recall, F1-score, and ROC curve.
Recall counts the number of positive samples in the collection that are true positives. The
ratio of true positives to true and false negatives is computed. A high recall score suggests
that the model has a low incidence of false negatives, which makes it less likely to fail to
detect actual ransomware samples.
4.5 Hybrid Detection
4.5.1 Description
4.6.1 Description
In this project, The process begins with collecting data from various sources within
the organization's IT infrastructure, including endpoint devices, network traffic, system
logs, and security event logs. This data serves as input for MLI-driven analysis. Before
feeding the data into AI algorithms, preprocessing steps may be required to clean and
prepare the data for analysis. This could involve tasks such as data normalization, feature
scaling, and handling missing values.
Supervised learning models are trained using labeled datasets that contain examples
of normal and malicious behavior. The models learn to classify new data instances as
either benign or potentially malicious based on the patterns they've learned during
training. Unsupervised learning techniques may also be employed for anomaly detection.
Anomaly detection techniques are used to identify deviations from normal behavior
that may indicate ransomware activity. ML algorithms detect anomalies by comparing
current data patterns to historical norms or by learning patterns from unlabeled data.
The rise of ransomware is attributed to many different factors since it first appeared in
1989. The emergence of ransomware as a service has also increased the availability of
ransomware to potential criminals who are less technically gifted. Crypto Locker, Crypto
Wall, and Locky offer this type of service with the variant Crypto Wall, generating more
than 320 million dollars in revenue during its lifespan
5.4.4 RANSOMWARE METHODOLOGY
FIG.5.4.4 RANSOMWARE METHODOLOGY
Installation occurs after the payload has been dropped into the system. One prominent
method of installation is the download dropper. This approach uses an initial file which
involves using a small piece of code to evade detection and reach out to the command and
control (C&C) center. Ransomware authors will attempt to split execution into different
scripts and processes to avoid AV (Anti-Virus) signature-based detection. When an
organization is targeted in an attack, ransomware will spread through the network,
determining file share locations and infecting them to maximize disruption and increase
the possible ransom. The executables will not run until multiple machines have been
infected.
CHAPTER 6
SYSTEM IMPLEMENTATIONS
6.1 SYSTEM IMPLEMENTATION
Ransomware like most malware, progress through several phases. radar can spot known
and unknown ransomware across these phases. Early detection can help prevent damage
done in later phases. Qader provides content extensions that include hundreds of use cases
to generate alerts across these phases. Content extensions are delivered through the App
Exchange and provide the ability to get the latest use cases. IBM Security® X-Force®
Threat Intelligence collections are used as references in use cases to help find the latest
known indicators of compromise (IOC), such as IP addresses, malware file hashes, URLs
and more.
Initial Access
Execution, Persistence
Discovery, Lateral Movement, Collection
Exfiltration, Impact
Initial Access
The ransomware is scanning the machine to analyze the administrative rights it could
obtain, make itself run at boot, disable recovery mode, delete shadow copies, and more.
Now that ransomware owns the machine from the starting phase, it will begin a phase of
reconnaissance of the network (attack paths), folders and files with predefined extensions,
and others. The real damage begins now.
Execution, Persistence
This is the moment the stopwatch starts. Ransomware is now in your environment. If the
ransomware used a “dropper” to avoid detection in the distribution phase, this is when the
dropper calls home and downloads the "real executable” and runs it.
Now that ransomware owns the machine from the starting phase, it will begin a phase of
reconnaissance of the network (attack paths), folders and files with predefined extensions,
and others.
Exfiltration, Impact
The real damage begins now. Typical actions include: create a copy of each file, encrypt
the copies, place the new files at the original location. The original files might be
exfiltrated and deleted from the system, which allows the attackers to extort the victim
with threats of making their breach public, or even to leak stolen documents.
Gather data from diverse sources such as endpoint devices, network traffic, system logs,
and security event logs. Ensure the collected data is representative of normal and
malicious activities.
Clean the data to remove noise, handle missing values, and standardize formats. Perform
tasks such as data normalization, feature scaling, and feature engineering to prepare the
data for analysis.
Extract relevant features from the preprocessed data that can help differentiate between
normal and ransomware activities. Features may include file access patterns, process
behaviors, network traffic characteristics, and system resource usage.
Annotate the dataset with labels indicating whether each data instance represents normal
behavior or ransomware activity.This labeled dataset will be used for training the machine
learning models.
Choose appropriate machine learning algorithms based on the problem at hand and the
characteristics of the data. Common ML algorithms for ransomware detection include
supervised learning algorithms
Step 7: Model Training:
Split the labeled dataset into training and testing sets. Train the selected machine learning
models on the training set using appropriate techniques and algorithms. Validate the
trained models using the testing set to ensure generalization and robustness.
Evaluate the performance of the trained models using relevant evaluation metrics such as
accuracy, precision, recall, F1-score, and ROC-AUC.
Deploy the trained machine learning models into the production environment for real-time
ransomware detection. Integrate the models with existing security infrastructure such as
endpoint protection systems, network intrusion detection systems (NIDS), and Security
Information and Event Management (SIEM) platforms.
Continuously monitor incoming data streams in real-time using the deployed machine
learning models. Analyze data patterns and identify anomalies indicative of ransomware
activity.
CHAPTER 7
CONCLUSION AND FUTURE ENHANCEMENT
7.1 CONCLUSION
Data quality and quantity—A vast amount of high-quality data are needed to train
machine learning models effectively. However, obtaining high-quality data for
ransomware detection is challenging due to the limited availability of labeled ransomware
samples Rapidly evolving ransomware—Ransomware is constantly changing threat, with
new variants and attack techniques being developed regularly. This makes it challenging
to build machine learning models that can detect all ransomware accurately and quickly.
Preprocessing data for ransomware detection also presents several challenges.
Developing more robust and accurate models—Researchers must build more substantial
and precise machine learning models that detect a wide range of ransomware variants and
attack techniques. This can be achieved through advanced techniques such as deep
learning and ensemble learning.
import os
import pandas as pd
import numpy as np
import pickle
import pefile
import sklearn.ensemble as ek
import warnings
warnings.filterwarnings("ignore")
df=pd.read_csv("Ransomware.csv",sep='|')
initial_size = getsizeof(df)/(1024.0**3)
df.legitimate = df.legitimate.astype('category')
df.legitimate
plt.show()
df.md5.nunique()
df.md5.shape[0]
df.shape[1]
df.columns
df.dtypes
print(X_test.shape[0] + X_train.shape[0])
X_test.iloc[i]
import os
import getpass
key = Fernet.generate_key()
username = getpass.getuser()
url = 'C:\\Users\\'+username+'\\Desktop'
print(url)
os.chdir(url)
print(os.getcwd())
f = open("demo.txt", 'w')
f.write("hello world")
f.close()
if 'yes' in getPermission:
filekey.write(key)
filekey.close()
original = file.read()
file.close()
fernet = Fernet(key)
encrypted = fernet.encrypt(original.encode())
encrypted_file.write(encrypted)
encrypted_file.close()
else:
print('thank you')