42 - Machine Learning Techniques For Cyber Attacks Detection
42 - Machine Learning Techniques For Cyber Attacks Detection
Master
of
Computer Application
Submitted by
STUDENT_NAME
ROLL_NO
Under the esteemed guidance of
GUIDE_NAME
Assistant Professor
CERTIFICATE
This is to certify that the project report entitled PROJECT NAME” is the bonafied record
of project work carried out by STUDENT NAME, a student of this college, during the
academic year 2014 - 2016, in partial fulfillment of the requirements for the award of the
degree of Master
Of Computer Application from St.Marys Group Of Institutions Guntur of Jawaharlal
Nehru Technological University, Kakinada.
GUIDE_NAME,
Asst. Professor Associate. Professor
(Project Guide) (Head of Department,
CSE)
DECLARATION
We, hereby declare that the project report entitled “PROJECT_NAME” is an original work done at
St.Mary„s Group of Institutions Guntur, Chebrolu, Guntur and submitted in fulfillment of the
requirements for the award of Master of Computer Application, to St.Mary„s Group of Institutions
Guntur, Chebrolu, Guntur.
STUDENT_NAME
ROLL_NO
i
ACKNOWLEDGEMENT
We consider it as a privilege to thank all those people who helped us a lot for successful
completion of the project “PROJECT_NAME” A special gratitude we extend to our guide
GUIDE_NAME, Asst. Professor whose contribution in stimulating suggestions and
encouragement ,helped us to coordinate our project especially in writing this report,
whose valuable suggestions, guidance and comprehensive assistance helped us a lot in
presenting the project “PROJECT_NAME”.
We would also like to acknowledge with much appreciation the crucial role of our Co-
Ordinator GUIDE_NAME, Asst.Professor for helping us a lot in completing our project.
We just wanted to say thank you for being such a wonderful educator as well as a
person.
We express our heartfelt thanks to HOD_NAME, Head of the Department, CSE, for his
spontaneous expression of knowledge, which helped us in bringing up this project
through the academic year.
STUDENT_NAME
ROLL_NO
v
ABSTRACT
Bot detection using machine learning (ML), with network flow-level features, has
been extensively studied in the literature. However, existing flow-based
approaches typically incur a high computational overhead and do not completely
capture the network communication patterns, which can expose additional aspects
of malicious hosts. Recently, bot detection systems that leverage communication
graph analysis using ML have gained attention to overcome these limitations.
v
TABLE OF CONTENTS
CHAPTER TITLE PAGE
NO NO
ABSTRACT V
LIST OF FIGURES ix
LIST OF ABBREVIATIONS X
1 INTRODUCTION 1
1.1. OVERVIEW 2
1.2. OBJECTIVE 2
1.3. SCOPE 2
2 LITERATURE SURYEY 3
3 METHODOLOGY 10
3.1 EXISTING 10
SYSTEM
DISADVANRAGES
3.2 PROPOSED WORK 10
3.4.1 K-MEANS 11
3.5 ADVANTAGES 14
v
TABLE OF CONTENTS
CHAPTER TITLE PAGE
NO NO
3.6 SOFTWARE AND 14
HARDWARE
3.7 SYSTEM STUDY 15
3.7.1 ECONOMICAL 15
FEASIBILITY
3.7.2 ECHNICAL FEASIBILITY 15
3.9.7 COLLABORATION 24
DIAGRAM
3.9.8 COMPONENT DIAGRAM 25
v
CHAPTER TITLE PAGE
NO NO
3.9.9 DEPLOYMENT DIAGRAM 26
3.10 MODULES 26
3.10.1 ALGORITHM 28
REFERENCE APPENDICES 31
A.SCREENSHOTS 33
B.SOUREC CODE 37
C.PLAGARISM REPORT 39
i
LIST OF FIGURES
FIGURE NO NAME OF THE FIGURE PAGE NO
1 SYSTEM ARCHITECTURE 16
4 CLASS DIAGRAM 19
5 OBJECT DIAGRAM 20
6 STATE DIAGRAM 21
7 ACTIVITY DIAGRAM 23
8 SEQUENCE DIAGRAM 24
9 COLLABORATION DIAGRAM 25
10 COMPONENT DIAGRAM 25
11 DEPLOYMENT DIAGRAM 26
x
LIST OF ABBREVIATIONS
PD Pandas
NX Networkx
TTK Tkinter
x
CHAPTER 1
INTRODUCTIO
N
Now a days everyone is storing their information in their systems. Here comes a
problem in providing security to their systems. On other hand cyber-attacks are also
increasing randomly which can hack your personal data like photos, social media and
chats. Bot attacks increased worldwide. There are also some servers getting hacked
which contains data of some lakhs people, where hacking a server is equal to
hacking some lakhs people data.
Botnet is also a type of cyber-attack which is a collection of internet-connected
devices, where these devices are called as bot. By using this bots the attacker can
also hack a big servers. These bots all together called as bot army. Botnet can make
time-consuming tasks easier because of its army. Botnet also perform helpful tasks
people are using it for malicious works. It is also a source of many malicious
activities. The different models of botnet are Client/Server .There are many types in
botnet like centralized, client-server, decentralized and peer-to-peer models and
attacks such as DDoS, phishing, cryptojacking, snooping, bricking, Brute force and
spambots. Common Botnet actions are Email spam, Financial breach, Targeted
intrusions. A bot herder can do a collective of hijacked devices by using remote
commands. Once your machine is infected, it becomes a bot, you may not even
know. Botnet leads to Financial theft, Informational theft, Sabotage of services,
Selling access to other criminals. The 3 main components of botnet are the bots,
Botnet attacks has been increased in the recent years at the same time different
types of Botnet detection frameworks are also increased.
The hacker can access the device only when his application was in the device. Once
his application started running in the device then he can steal, change or destroy
information. The hacker can also steal money, username and passwords. The hacker
can also change your confidential data. Also install and run any application in your
system he want. All the devices which are connected to the internet can be hacked
by the hacker. The more targeted devices like desktop and laptops which runs on
Windows OS or macOS. Mobiles are next target devices as more people are using by
connecting them to the internet. Recent years connecting devices to the internet has
increased rapidly botnets also create from connected devices has become more
noted.
First the hacker will start by injecting the malware infection to your device. some
download links to the target device to hack the device. For example Trojan Horse
(Happy New Year! Click here to see magic). If the owner of the device does not know
about whether the download link is an attacker link and if he click on the link then the
hacker application will get download in the device and sit around wait for command
from the main system (hacker system). Now the hacker can access everything from
his device. In order not to get attacked by hackers he should know all the malware
links, so he can save his device from hacker. To stay away from malware links his
1
device should able to find the malware links or prevent the initial infection or identify
2
an existing infection. Botnet attacks are hard to detect. Preventing botnet attacks is
more difficult. Yet we can still take certain measures to prevent botnet attacks.
1.1 OVERVIEW
Cyberattacks are on the rise these days. Many systems are getting infected by
attacks to overcome these attacks, In the past, we used signature-based research.
However, as technology developed, attacks became more sophisticated and we used
k-means and decision trees to see how many bots were targeted and how many were
not. If there is an attack, we will find how many bots were attacked or detected and
we will give the number.
1.2 OBJECTIVE
A botnet is a collection of bots, agents in compromised hosts, controlled by
botmasters via command and control (C2) channels. A malevolent adversary controls
the bots through botmaster, which could be distributed across several agents that
reside within or outside the network. Hence, bots can be used for tasks ranging from
distributed denial-of-service (DDoS), to massive-scale spamming, to fraud and
identify theft. While bots thrive for different sinister purposes, they exhibit a similar
behavioral pattern when studied up-close. The intrusion kill-chain dictates the
general phases a malicious agent goes through in-order to reach and infest its target.
1.3 SCOPE
For this phase in BotChase, we evaluate four SL techniques, namely DT, LR, SVM
and FNN. We use DT with Gini instance split rule algorithm, LR without
regularization, and SVM with the Gaussian kernel and a soft margin penalty of 1.
Moreover, NN is configured to use cross entropy as an error function and 10 hidden
layers of 1000 units each. The DT classifier shows the best performance with the
small dataset, as depicted in Table IV. It successfully detects all bots in the test
dataset, with only a single FP out of the 366871 benign hosts. In contrast, all other
classifiers are lackluster and unable to recall even a single bot from the dataset. We
believe this is because all classifiers, except DT, rely on gradient-descent for
errorcorrection. This implies that every single node in the dataset will affect the end-
hypothesis function. Thus, with a dataset that is unbalanced, the hypothesis will be
biased towards the benign hosts, which is the case for LR, SVM and FNN.
3
CHAPTER 2
LITERATURE SURVEY
5
network security problem. The traditional security solution becomes inefficient in the
new situation. Therefore, it is an important task for the security industry to seek
technical progress and improve the protection detection and protection ability of the
security industry. Botnets have been one of the most important issues in many
network security problems, especially in the last one or two years, and China has
become one of the most endangered countries by botnets, thus the huge impact of
botnets in the world has caused its detection problems to reset people's attention.
This paper, based on the topic of botnet detection, focuses on the latest research
achievements of botnet detection based on machine learning technology. Firstly, it
expounds the application process of machine learning technology in the research of
network space security, introduces the structure characteristics of botnet, and then
introduces the machine learning in botnet detection. The security features of these
solutions and the commonly used machine learning algorithms are emphatically
analyzed and summarized. Finally, it summarizes the existing problems in the
existing solutions, and the future development direction and challenges of machine
learning technology in the research of network space security.
2.7 Botnet and P2P Botnet Detection Strategies: A Review
AUTHORS: Jitender Kumar , Himanshi Dhayal
ABSTRACT: Among various network attacks, botnet led attacks are considered as
the most serious threats. A botnet, i.e., the network of compromised computers is
able to perform large scale illegal activities such as Distributed Denial of Service
attacks, click fraud, bitcoin mining etc. These attacks are considered as the major
concern now-a-days. In this paper, we present a comprehensive review of botnets,
their lifecycle and types. We also discuss the peer-to-peer botnet detection
techniques' behaviors using various latest detection techniques.
2.8 Botnet Detection Using Recurrent Variational
Autoencoder AUTHORS: Jeeyung Kim, Alex Sim, Jinoh Kim,
Kesheng Wu
ABSTRACT: Botnet detection is an active research topic as botnets are a source of
many malicious activities, including distributed denial-of-service (DDoS), click-fraud,
spamming, and crypto-mining attacks. However, it is getting more complicated to
identify botnets due to the continuous evolution of botnet software and families that
harness new types of devices and attack vectors. Recent studies employing machine
learning (ML) showed improved performance to detect botnets to some extent, but
they are still limited and ineffective with the lack of sequential pattern analysis, which
is a key to detect various classes of botnets. In this paper, we propose a novel botnet
detection method, built upon Recurrent Variational Autoencoder (RVAE), that
effectively captures sequential characteristics of botnet anomalies. We validate the
feasibility of the proposed method with the CTU-13 dataset that have been widely
employed for botnet detection studies, and show that our method is at least
comparable to existing techniques in terms of detection accuracy. In addition, our
experimental results show that the proposed method can detect previously unseen
botnets by utilizing sequential patterns of network traffic. We will also show how our
method can detect botnets in the streaming mode, which is the essential requirement
6
to perform real-time, on-line detection.
7
2.9 Sonification of Network Traffic for Detecting and Learning About Botnet
Behavior AUTHORS: Mohamed Debashi, Paul Vickers
ABSTRACT: Today's computer networks are under increasing threat from malicious
activity. Botnets (networks of remotely controlled computers, or ―bots‖) operate in
such a way that their activity superficially resembles normal network traffic which
makes their behavior hard to detect by current intrusion detection systems (IDS).
Therefore, new monitoring techniques are needed to enable network operators to
detect botnet activity quickly and in real time. Here, we show a sonification technique
using the SoNSTAR system that maps characteristics of network traffic to a real-time
soundscape enabling an operator to hear and detect botnet activity. A case study
demonstrated how using traffic log files alongside the interactive SoNSTAR system
enabled the identification of new traffic patterns characteristic of botnet behavior and
subsequently the effective targeting and real-time detection of botnet activity by a
human operator. An experiment using the 11.39 GiB ISOT botnet data set, containing
labeled botnet traffic data, compared the SoNSTAR system with three leading
machine learning-based traffic classifiers in a botnet activity detection test. SoNSTAR
demonstrated greater accuracy (99.92%), precision (97.1%), and recall (99.5%) and
much lower false positive rates (0.007%) than the other techniques. The knowledge
generated about characteristic botnet behaviors could be used in the development of
future IDSs.
2.10 Analysis of Machine Learning Algorithms for IoT Botnet
AUTHOR: Umang Garg, Vaibhav Kaushik, Anushka Panwar, Neha Gupta
ABSTRACT: The Internet of Things (IoT) gains a lot of popularity day-by-day due to
their everlasting availability and ease. As the popularity of IoT increases, it also
attracts hackers which try to take advantage of the vulnerability of IoT devices. An
Intrusion Detection System (IDS) is an intelligence-based system that can investigate
or detect the intrusion in the IoT botnet and check the state of software and hardware
executing in the network. Once the intrusion is detected, it may generate an alarm to
alert the administrator or send some alert message to the owner. In the last decade,
there are several IDSs available which can detect the intrusion. But the major
problems with the existing IDSs like accuracy rate, generation of the false alarm, and
fewer chances of detection of unknown attacks. To deal with the above problems,
some machine learning techniques have been involved by researchers. These
techniques can differentiate between the normal and abnormal behavior of the user's
data or network traffic with high accuracy. In this paper, we summarize and classify
the machine learning algorithms that can be used in IDS with their metrics,
parameters. Then, a case study is implemented with the UNSW-NB15 dataset that
has realistic network traffic with frequently used machine learning techniques. After
that, a comparison will be done and displayed by using an accuracy percentage table
and a bar chart. Finally, some challenges and future scope of the machine learning
techniques in the improvement of IDS will be discussed.
2.11 An enhancing framework for botnet detection using generative adversarial
networks
AUTHORS: Chuanlong Yin, Yuefei Zhu, Shengli Liu, Jinlong Fei, Hetong Zhang
8
ABSTRACT: The botnet, as one of the most formidable threats to cyber security, is
often used to launch large-scale attack sabotage. How to accurately identify the
botnet, especially to improve the performance of the detection model, is a key
technical issue. In this paper, we propose a framework based on generative
adversarial networks to augment botnet detection models (Bot-GAN). Moreover, we
explore the performance of the proposed framework based on flows. The
experimental results show that Bot-GAN is suitable for augmenting the original
detection model. Compared with the original detection model, the proposed approach
improves the detection performance, and decreases the false positive rate, which
provides an effective method for improving the detection performance. In addition, it
also retains the primary characteristics of the original detection model, which does
not care about the network payload information, and has the ability to detect novel
botnets and others using encryption or proprietary protocols.
2.12 A Survey on Botnet Detection Techniques
AUTHOR: Shivani Gaonkar, Nandini Fal Dessai, Jenny Costa
ABSTRACT: Due to the increased rate of internet usage, security problems have also
increased. One of the serious threats in network security are Botnets. A Botnet is
defined as a collection of various bots that Botmaster controls through the Command
and Control (C&C) channel. During recent times, different technologies and
techniques have been proposed to track the detection of botnets. This paper
summarizes different techniques to detect different botnets. General bot detection
and IoT-bot detection techniques are separately explained. UNSW-NB15 datasets
have been used in training and testing of the proposed model. A real-time IoT-Bot
detection using deep learning algorithm is proposed in this paper. Wireshark is used
to capture a package from network traffic.
2.13 Analysis of Botnet Domain Names for IoT Cybersecurity
AUTHOR: Wanting Li, Jian Jin, Jong-Hyouk Lee
ABSTRACT: Botnets are widespread nowadays with the expansion of the Internet
and commonly occur in many cyber-attacks, resulting in serious threats to network
services and users' properties. With the rapid development of the Internet of Things
(IoT) applications, the botnet can easily make use of IoT devices for larger-scale
attacks. Domain name system (DNS) is widely used by the botnet to establish the
connection between bots and their corresponding command-and-control (C&C). In
order to avoid the track of the C&C through the DNS information, some sophisticated
schemes are used by the botnet and fast-flux is a typical one. In this paper, the
activities of Rustock botnet domain names which just use the fast-flux as the
connection method between bots and C&C, are deeply analyzed from multiple
aspects. Besides, we extract 32 special features of Rustock domain named querying
traffic. Then multiple popular classifiers are adopted in order to pick the malicious
domain names out from the DNS traffic using those 32 features. The work of this
paper aims to provide guidance for future botnet detection based on real statics and
experiments.
9
AUTHOR: Paul Sroufe, Santi Phithakkitnukoon, Ram Dantu, Joao Cangussu
ABSTRACT: Botnets have become the major sources of spamming, which generates
massive unwanted traffic on networks. An effective detection mechanism can greatly
mitigate the problem. In this paper, we present a novel botnet detection mechanism
based on the email "shape" analysis that relies on neither content nor reputation
analysis. Shape is our new way of characterizing an email by mimicking human visual
inspection. A set of email shapes are derived and then used to generate a botnet
signature. Our preliminary results show greater than 80% classification accuracy
(without considering email content or reputation analysis). This work investigates the
discriminatory power of email shape, for which we believe will be a significant
complement to other existing techniques such as a network behavior analysis.
2.15 Bot Detection via IoT Environment
AUTHOR: Im Y. Jung, Jae J. Jang, Jae-geun Moon
Abstract: Many users do not realize whether their devices become bots or not. There
are many security accidents due to malicious bots. To solve this problem, we propose
a monitor system composed of IoT devices to detect bots.
1
AUTHOR: Anaël Bonneton, Daniel Migault, Stephane Senecal, Nizar Kheir
ABSTRACT: This paper introduces a behavioral model for botnet detection that
leverages the Domain Name System (DNS) traffic in large Internet Service Provider
(ISP) networks. More particularly, we are interested in botnets that locate
and connect to their command and control servers thanks to Domain Generation
Algorithms (DGAs). We demonstrate that the DNS traffic generated by hosts
belonging to a DGA botnet exhibits discriminative temporal patterns. We show how to
build decision tree classifiers to recognize these patterns in very little computation
time. The main contribution of this paper is to consider whole time series to represent
the temporal behavior of hosts instead of aggregated values computed from the time
series. Our experiments are carried out on real world DNS traffic collected from a
large ISP.
2.19 An analysis of network traffic classification for botnet detection
AUTHOR: Matija Stevanovic, Jens Myrup Pedersen
ABSTRACT: Botnets represent one of the most serious threats to the Internet
security today. This paper explores how network traffic classification can be used for
accurate and efficient identification of botnet network activity at local and enterprise
networks. The paper examines the effectiveness of detecting botnet network traffic
using three methods that target protocols widely considered as the main carriers of
botnet Command and Control (C&C) and attack traffic, i.e. TCP, UDP and DNS. We
propose three traffic classification methods based on capable Random Forests
classifier. The proposed methods have been evaluated through the series of
experiments using traffic traces originating from 40 different bot samples and diverse
non-malicious applications. The evaluation indicates accurate and time-efficient
classification of botnet traffic for all three protocols. The future work will be devoted to
the optimization of traffic analysis and the correlation of findings from the three
analysis methods in order to identify compromised hosts within the network.
2.20 Botnet Domain Name Detection based on machine learning
AUTHOR: Baoping Yan, Guanggang Geng, Zhiwei Yan, Jian Jin
ABSTRACT: Domain Name System (DNS) is a fundamental component of today's
Internet: it provides mappings between domain names used by people and the
corresponding IP addresses required by network protocols. However, the open and
fundamental characteristics of DNS are recently used by the botnet for the
communication between bots and C&C. In this paper, we select six kinds of special
features of botnet domain querying traffic based on the deep studies of the DNS log.
Then three popular classifiers are adopted in order to pick the malicious domains
outfrom the DNS traffic using those features.
1
CHAPTER 3
METHODOLOG
Y
• Donot completely capture the network communication patterns, which can expose
additional aspects of malicious hosts.
Cyberattacks are on the rise these days. Many systems are getting infected by
attacks to overcome these attacks, In the past, we used signature-based research.
However, as technology developed, attacks became more sophisticated and we used
k-means and decision trees to see how many bots were targeted and how many were
not. If there is an attack, we will find how many bots were attacked or detected and
we will give the number.
LIMITATIONS
Botnet detection has been an active area of research that has generated a
substantial body of work. Common botnet detection approaches are passive. They
assume successful intrusions and focus on identifying infected hosts (bots) or
detecting C2 communications, by analyzing system logs and network data, using
signature- or anomaly-based techniques. Signature-based techniques have
commonly been used to detect pre-computed hashes of existing malware in hosts
1
and/or network traffic. They are also used to isolate IRC-based bots by detecting bot-
1
like IRC nicknames and to identify C2-related DNS requests by detecting C2-like
domain names. Metadata such as regular expressions based on packet content and
target IP occurrence tuples is an example of what could be employed in a signature
and pattern detection algorithm. More generally, signature-based techniques have
been employed to identify C2 by comparison with known C2 communication patterns
extracted from observed C2 traffic, and infected hosts by comparison with static
profiles and behaviours of known bots.
In the application CTU-13 dataset is used form kaggle Upload ctu-13 dataset button
,it open the files.There we select the dataset click on open. After uploading
the dataset on screen it display the path from where we are taking dataset , dataset
size Also displays total rows and total columns, showing the Start Time, Duration,
Protoc ,
SrcAddrress,Sport,Dire,DstAddress,Dport,State,sTos,dTos,TotalPackets,TotalBytes,
SrcBytes,Label and also the rows and columns in side square braces.
Apply k-means to separate bot and benign data from the data set. It gives us the
dataset size before removing benign records, i.e (total rows and columns). gives the
dataset size after removing the benign records, i.e (total rows and columns) By using
k-means we separated the Bot and Benign data.
When we have a look at the CMD there it show as generated bot graph points On UI
it shoes the number of nodes , number of edges, number of graph created , between-
Ness centrality for all IP address or node. Here ip address nothing but nodes,
Execution time, clustering time calculation, alpha centrality time calculation Alpha
Centrality time.
After clicking on it, Normalizing features process completed & below are some
sample records out out- degree-weight in-degree-wt outdegree ,indegree bot bc
lcc ac. All the values of it which are normalized, Normalized & transformed data
saved inside normalize_data. csv file, as well as we can have a look at the CMD
there it show as features normalization module 100 percent done and shoes the
record in it.
It shows Normalized data loading to decision tree classifier Total dataset size to build
model.Model training records size, Model testing records size, Decision Tree
Accuracy , Decision Tree Precision , Decision Tree Recall ,True -Pos , False-
Pos,True-Neg, False-Neg.we have test 20 % of data, and training 80% of data .
The Accuracy of this model is 99%.
1
the data and and tested the data , once algorithm is well trained, it is tested using the
new data when it comes to unsupervised learning the training phase is big because
the machine is only given the input,it has to figure out the output on its own, so there
is no supervisor here or there’s is no mentor over here.
3.4.1 K-means
It’s a technique most of us do in our daily life, for example like group of people
sharing tableClustering is the process of dispersing datasets into groups consisting of
similar data points. For example: k-means clustering. Exclusive clustering is hard
clustering, where points/items belong only to one cluster.
In this project we are using k-means and desicion tree algorithms for building this
projecte.
To execute the project we have to click on run , then the CMD opens which shows
the path of project where it located, after that the user interface opens, splits of 2
screens one screen contains buttons Other side it shows the executed functions
output.
CHAPTER
A. Upload CTU Dataset
B. Apply KMEANS to separate Bot & Benign Data
1
C. Run Flow Ingestion & Graph Transformation
D. Features Extraction & Normalization
E. Run Decision Tree Algorithm
F. View Graph
G. Exit
First open the application then run in the command prompt User Interface (UI) is
displayed. On UI you will have some buttons like Upload CTU Dataset, Apply
KMEANS to separate Bot & Benign Data, Run Flow Ingestion & Graph
Transformation, Features Extraction & Normalization, Run Decision Tree Algorithm,
View Graph, Exit. Click on the first button i.e Upload CTU Dataset, then some
datasets are displayed. Select one among them and click open. It gives the dataset
size like total rows, total columns and also dataset samples. Then click on the second
button i.e Apply KMEANS to separate Bot & Benign Data. It gives you the information
about dataset size before removing the benign records and after removing the benign
records. This button will apply k means algorithm to the dataset and separate as two
clusters namely bot and benign and will remove the benign records from the set. Then
click on the third button i.e Run Flow Ingestion & Graph Transformation. It gives the
information like number of nodes, number of edges and betweenness. Then click on
the the fourth button Features Extraction & Normalization. It complete the Normalizing
features process and display some sample records. Then click on the Run Decision
Tree Algorithm. It display information like Decision Tree Accuracy, Decision Tree
Precision, Decision Tree Recall, True Positive, False Positive, True Negative, False
Negative. Then you can select number of nodes to draw graph. After selecting the
nodes you can click on View Graph to display the graph. The graph displays the
cluster. Last you will find a exit button to exit from the UI. Click on the exit to close the
interface. If you want to find botnet attacks from other datasets, then you can again
upload a new dataset in the upload button and repeat the steps like applying k means,
then click Run Flow Ingestion & Graph Transformation, then feature extraction and
normalization and apply run decision tree algorithm.
1
3.5 Advantage
HARDWARE REQUIREMENTS
Minimum hardware requirements are very dependent on the particular software being
developed by a given Enthought Python / Canopy / VS Code user. Applications that
need to store large arrays/objects in memory will require more RAM, whereas
applications that need to perform numerous calculations or tasks more quickly will
require a faster processor.
1
3.7SYSTEM STUDY
FEASIBILITY
STUDY
The feasibility of the project is analyzed in this phase and business proposal is
put forth with a very general plan for the project and some cost estimates. During
system analysis the feasibility study of the proposed system is to be carried out. This
is to ensure that the proposed system is not a burden to the company. For feasibility
analysis, some understanding of the major requirements for the system is essential.
This study is carried out to check the economic impact that the system will have on
the organization. The amount of fund that the company can pour into the research
and development of the system is limited. The expenditures must be justified. Thus
the developed system as well within the budget and this was achieved because most
of the technologies used are freely available. Only the customized products had to be
purchased.
3.7.2 Technical Feasibility
This study is carried out to check the technical feasibility, that is, the technical
requirements of the system. Any system developed must not have a high demand on
the available technical resources. This will lead to high demands on the available
technical resources. This will lead to high demands being placed on the client. The
developed system must have a modest requirement, as only minimal or null changes
are required for implementing this system.
3.7.3 Social Feasibility
The aspect of study is to check the level of acceptance of the system by the
user. This includes the process of training the user to use the system efficiently. The
user must not feel threatened by the system, instead must accept it as a necessity.
The level of acceptance by the users solely depends on the methods that are
employed to educate the user about the system and to make him familiar with it. His
level of confidence must be raised so that he is also able to make some constructive
criticism, which is welcomed, as he is the final user of the system.
1
Fig 1 .SYSTEM ARCHITECTURE
1.The DFD is also called as bubble chart. It is a simple graphical formalism that can
be used to represent a system in terms of input data to the system, various
processing carried out on this data, and the output data is generated by this system.
2.The data flow diagram (DFD) is one of the most important modeling tools. It is used
to model the system components. These components are the system process, the
data used by the process, an external entity that interacts with the system and the
information flows in the system.
3. DFD shows how the information moves through the system and how it is modified
by a series of transformations. It is a graphical technique that depicts information flow
and the transformations that are applied as data moves from input to output.
4. DFD is also known as bubble chart. A DFD may be used to represent a system at
any level of abstraction. DFD may be partitioned into levels that represent increasing
information flow and functional detail.
2
User
Yes
Unauthorize
NO Check
d user
Preprocess Dataset
Model Generation
DBSCAN Clustering
KMean Clustering
Visualize Clusters
End process
2
3.8.1 Introduction To UML
The Unified Modeling Language (UML) is a standard language for specifying,
visualizing, constructing, and documenting the artifacts of software systems, as well
as for business modeling and other non-software systems. The UML represents a
collection of best engineering practices that have proven successful in the modeling
of large and complex systems. The UML is a very important part of developing
objects oriented software and the software development process. The UML uses
mostly graphical notations to express the design of software projects. Using the UML
helps project teams communicate, explore potential designs, and validate the
architectural design of the software.
2
3.9.1 Use case diagram:
The use case diagram is used to identify the primary elements and processes that
form the system. The primary elements are termed as "actors" and the processes are
called "use cases." The use case diagram shows which actors interact with each use
case.
2
3.9.2 Class diagram:
The class diagram is used to refine the use case diagram and define a detailed
design of the system. The class diagram classifies the actors defined in the use case
diagram into a set of interrelated classes. The relationship or association between the
classes can be either an "is-a" or "has-a" relationship. Each class in the class
diagram may be capable of providing certain functionalities. These functionalities
provided by the class are termed "methods" of the class. Apart from this, each class
may have certain "attributes" that uniquely identify the class.
2
3.9.3 Object diagram:
The object diagram is a special kind of class diagram. An object is an instance of a
class. This essentially means that an object represents the state of a class at a given
point of time while the system is running. The object diagram captures the state of
different classes in the system and their relationships or associations at a given point
of time.
2
FIG 6:STATE DIAGRAM
2
3.9.5 Activity diagram:
The process flows in the system are captured in the activity diagram. Similar to a
state diagram, an activity diagram also consists of activities, actions, transitions, initial
and final states, and guard conditions.
2
3.9.6 Sequence diagram:
A sequence diagram represents the interaction between different objects in the
system. The important aspect of a sequence diagram is that it is time-ordered. This
means that the exact sequence of the interactions between the objects is represented
step by step. Different objects in the sequence diagram interact with each other by
passing "messages".
2
FIG 9:COLLABORATION DIAGRAM
2
3.9.9 Deployment diagram:
The deployment diagram captures the configuration of the runtime elements of the
application. This diagram is by far most useful when a system is built and ready to be
deployed.
3.10 Modules:
Upload CTU Dataset
Apply KMEANS to separate Bot & Benign Data
Run Flow Ingestion & Graph Transformation
Features Extraction & Normalization
Run Decision Tree Algorithm
Exit
3
C. RunFlow Integration and graph transformation
After clicking on run flow integration it shoes two screens extract which we need to
close ,when we have a look at the CMD there it show as generated bot graph points
, on ui it shoes the number nodes , number of edges, number of graph created ,
between-Ness centrality for all IP address or node , here ip address nothing but
nodes, Execution time, clustering time calculation, alpha centrality time calculation
Alpha Centrality time.
F. View Graph
In the final module there will be input textbox where we can enter some number into
it , so that it generate the graph after clicking on the view graph.it pop up another
screen shoes all the ip address and its connections. After completing the whole
project clicking on exit we exit from the GUI interface.
G.EXIT
Clicking on exit button we will exit from GUI interface.come out of project.
3
3.10.1 Algorithm:
k-Means, Density-Based Spatial Clustering (DBScan) and SOM, Decision Tree,
Feed-forward Neural Network (FNN), Logistic Regression (LR) and Support Vector
Machine (SVM).
3
should be outside the benign cluster, regardless of whether or not they are co-located
in the same cluster. Depending on the amount of hosts outside the benign cluster, the
supervised learning (SL) classifiers used in this phase will exhibit different results.
The primary objective in this phase is to maximize recall. Recall is a measure of how
many bots are recalled correctly i.e., do not go unnoticed. It is proportional to the
number of true positives (TPs) and inversely proportional to false negatives (FNs).
Various SL classifiers can be deployed in this phase to achieve this objective, such
as logistic regression (LR), support vector machine (SVM), feed-forward neural
network (FNN) and decision tree (DT).
• Logistic Regression (LR) and Support Vector Machine (SVM)—LR focuses on
binary classification of its input, based on a sigmoid function. Input features are
coupled with corresponding weights and fed into the function. Once a threshold p is
defined, usually 0.5 for the logistic function, it establishes the differentiator between
positive and negative points. Unlike LR, SVM is a non-probabilistic model for
classification. It is not restricted to linearly separable datasets. There are various
methods of computing SVM, including the renowned gradient-descent algorithm.
• Feed-forward Neural Network (FNN)—FNNs are artificial neural networks that do
not contain any cyclic dependencies. For a given feed-forward network with multiple
layers, a feature vector is dispersed into the input layer, fed to the hidden layer of the
network, and then to its output layer. While the input layer is constrained by the
number of features exposed, the hidden and output layers are not. Every neuron may
rely on a separate activation function that shapes the output. Popular activation
functions for FNNs include identity, sigmoid, ReLU and binary step, among others.
FNNs and the previously mentioned SL techniques are online classifiers. An online
classifier is capable of incremental learning, as the weights associated with the
deployed perceptrons are not static. This makes FNNs an attractive candidate for
production-grade deployment.
• Decision Tree (DT)—DTs rely heavily on information entropy (IE) and gain to
conjure its conditional routing procedure. Generally, IE states how many bits are
needed to represent certain stochastic information in the dataset. By using DT,
information gain is maximized from the observed data and the taken path. After
training a DT, newly observed data points can be predicted. However, unlike all the
other classifiers, DTs are not online. That is, optimally retraining a DT must be done
from scratch. Recall the objective from Phase 1 i.e., minimize hosts outside the
benign cluster (HOB), while maximizing bots outside the benign cluster (BOB). This
results in a minimal training dataset for Phase 2. Also, it is expected that the resultant
training dataset from Phase 1 would be unbalanced, with a bias towards benign
hosts. This may prove problematic for LR, SVM and FNN in achieving high recall
rates.
3
CHAPTER 4
RESULTS AND DISCUSSION, PERFORMANCE
ANALYSIS
The aim of this paper is to develop a user interface which can detect the Botnet
records based on graph. This application will detect Botnet records in the internet
connected system by using Machine learning algorithms and also detect the newly
attacks based on the graph which is plotted using the k means algorithm. Where as k
means is an unsupervised learning algorithm it will detect the newly created attacks
by the distance formula
The internet connected device owner can provide security to their systems by our
User Interface.
CHAPTER 5
SUMMARY AND
CONCLUSIONS
3
REFERENCES
Textbooks:
Journals:
[1]. Jay N. Paranjape ., Misha Mehra ., Jay N. Paranjape ., Vinay Joseph Ribeiro.,
― Improving ML Detection of IoT Botnets using Comprehensive Data
and Feature Sets ‖., 2021.
[3]. Mrutyunjaya Panda ., Abd Allah A. Mousa, Aboul Ella Hassanien ., ― Developing
an Efficient Feature Engineering and Machine Learning Model for Detecting IoT-
Botnet Cyber Attacks ‖ ., 2021.
[6]. Paul D. Yoo, Sami Muhaidat, Omar Y. Al-Jarrah ., Omar Alhussein ., Kwangjo
Kim., , Kamal Taha ., ― Data Randomization and Cluster- Based Partitioning for
Botnet Intrusion Detection ‖ ., 2015.
[7]. Duc C. Le ., Nur Zincir-Heywood ., ― Learning From Evolving Network Data for
Dependable Botnet Detection ‖., 2020.
[8]. Khalid Alsubhi., Afnan Alharbi ., Khalid Alsubhi., ― Botnet Detection Approach
Using Graph-Based Machine Learning ‖., 2021.
3
[9] S. Sriram ., Mamoun Alazab ., R. Vinayakumar, .,Soman KP ― Network Flow
based IoT Botnet Attack Detection using Deep Learning ‖ ., 2020.
[11] . Sean Miller ., Curtis Busby-Earle ., ― The role of machine learning in botnet
detection ‖ ., 2017.
[13] Stefano Secci ., Mathieu Bouet ., Agathe Blaise ., Vania Conan ., Stefano Secci
., ― Botnet Fingerprinting: A Frequency Distributions Scheme for Lightweight Bot
Detection ‖ ., 2020.
[15] Madhuri Gurunathrao Desai ., Kun Suo ., Yong Shi ., ― IoT Bonet and Network
Intrusion Detection using Dimensionality Reduction and Supervised Machine
Learning ‖ ., 2020.
3
SCREENSHOTS:
3
Fig 4: K-means
3
Fig 6: CMD graph build
3
Fig 8: Run Decision Tree
Fig 9: Graph