Ss

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 26

An Adaptive Intrusion Detection for IOT (Internet of Things)

ABSTRACT

IoT (Internet of Things) has diverse applications, and smart home application is one
among them. As all devices of a smart home are accessible through Internet, providing
security and privacy becomes a challenge. In other words, the smart devices are
vulnerable to variety of attacks.
As a result, there is need to develop Intrusion Detection System (IDS) for
smart home application. There exists a model (an IDS solution) which consists of two
components. The first component (based on Machine Learning approach) monitors the
IoT network behavior. The second component (based on Rule based approach) is
established from the security policies configured by the network administrator. This
model predicts the malicious activity accurately and prevents the attacks.

Keywords: Internet of Things, Intrusion Detection, Machine Learning

1.1 MOTIVATION

Our work focuses on growing a unique version that can expect malicious
behaviour and come across malicious Iot nodes on a
network.Extramainly,themodelconsistsofcomponents.The primary element
based on a machine getting to know (ml) technique which learns the networking
behaviour of the Iot-primarily based community. The second factor is a
rulebasedapproach that is mounted from a safety policy configured by means of
the community administrator. The aggregate of both additives creates an
adaptive and bendy version, on the way to permit us to correctly in expecting
malicious interest and prevent security attackson suchstructures.
1.2 PROBLEMDEFINITION

Aim is to build a model (Intrusion Detection System) using Machine Learning


Algorithms and evaluating its performance by using Accuracy, Precision, Recall
and F1score.
Additionally, to improve accuracy the combination of Machine learning
andrulebased approach had developed.
1.3 SOFTWARE AND HARDWARE

REQUIREMENTS SOFTWAREREQUIREMENTS

OPERATING SYSTEM : WINDOWS

LANGUAGE : PYTHON

HARDWARE REQUIREMENTS:

HARDWARE : PC
PROCESSOR : 2.4GHz Intel core two duo
RAM : 4 GB

1.5 Report Organization


Chapter 2 Provides an overview of the Literature Survey.
Chapter 3 Provides detailed information on UML design
Chapter 4 Gives the details on the implementation with
modules.
Chapter 5 Deals with testing and results of the proposed system.
Chapter 6 Gives conclusion deducted from the results followed by references and
Appendix

CHAPTER 2

LITERATURE SURVEY

2.1 MACHINE LEARNING

Machine learning is a discipline that deals with programming the systems so as to make
them automatically learn and improve with experience. Here, learning implies
recognizing and understanding the input data and taking informed decisions based on
the supplied data. Its very difficult to consider all the decisions based on all possible
inputs.

Machine Learning (ML) is an automated learning with some or no human intervention.


It involves programming computers so that they can learn from the available data and
inputs.
The ultimate goal of machine learning is to search and construct algorithm that can
learn from the previous data and make predictions on new input data.

input to the learning algorithm is the training data, and the output is any expertise,
which usually is similar to another algorithm that can perform a task. The input data to a
machine learning system can be number, text, audio, video, or multimedia. The
corresponding output data of this system can be a float or integer for instance, the
velocity of a rocket, an integer representing a category or a class, for example, a pigeon
or a sunflower from image recognition.

2.2 MACHINE LEARNING APPLICATIONS

developed ML algorithms are mainly used in various applications such as:


Vision processing
Language processing
Forecasting things like stock market trends, weather
Pattern recognition
Games
Data mining
Expert systems
Robotics

2.3. MACHINE LEARNING PROCESS

A machine learning process exist of a number of typical steps. These steps are:
• Determine the problem you want to solve using machine learning technology
• Search and collect training data for your machine learning development process.
• Opt a machine learning model
• Prepare the collected data for training the machine learning model
• Test your learning system using test data
• Validate and enhance the machine learning model. You will need to search for more
training data within this iterative loop.

2.4. MACHINE LEARNING METHODS

These algorithms are often considered as supervised or unsupervised.

Supervised machine learning algorithms can apply to what we have already learnt in
the past to new data using labeled examples for predicting the future events. from the
analysis of a known as training dataset,the learning algorithm give an input function to
make analysis about the output values. system is able to provide target for any new
input after sufficient training. These learning algorithms can also compare its output
with the correct, needed output, to detect errors in order to modify the model
accordingly.

The Supervised learning is where you have input variables (x) and an output variable
(Y) and you use an algorithm to learn the mapping the function from input to output.

Y = f(X)

The goal is to approximate the mapping function so well that when you have new input
data (x) that you can predict the output variables (Y) for that data.

Supervised learning problems can furthur be classified into regression and classification
problems.

Classification: when the output variable is a category, such as “red” or “blue” or


“disease” and “no disease”, it is aclassification problem
Regression: When the output variable is a real value, such as “dollars” or “weight” it is
a regression problem

Unsupervised machine learning : these unsupervised algorithms used when the


information used to train is non-classified, unlabeled data. Unsupervised learning
studies how system can define a function to describe a hidden format from unlabeled
data. The system does not describe the correct output, but it explores the data and can
draw information from datasets to describe hidden structures from unlabeled data.

Unsupervised learning is where you only have input data (X) and not exact
corresponding output variables.

The aim of these type of learning is to model the underlying the structure or the
distribution of the data in order to learn more about the data.

This is called unsupervised learning because unlike supervised learning above there is
no correct answers and there is no teacher. Algorithms lefts their own devise to
discover and present the interesting structure in the data.

Unsupervised learning can further be classified into clustering and association


problems.

Clustering: A clustering problem is where you want to discover the inherent groupings
in the data, such as grouping customers by purchasing behavior.
Association: An association rule learning problem is where you want to discover rules
that describe large portions of your data, such as people that buy X also tend to buy Y.

2.5. The Purpose of Machine Learning

Machine learning can also be seen as a branch of AI or Artificial Intelligence, since, the
ability to change experience into expertise or to detect patterns in complex data is a
mark of human or animal intelligence.
As a field of science, machine learning shares common concepts with other disciplines
such as statistics, information theory, game theory, and optimization.
As a subfield of IT is to program machines so that they will learn.
However, purpose of the ml is not building an
automated duplication of intelligence behaviors, but using the power of computers to
complement and supplement human intelligence. For example, machine learning
programs can scan and process huge databases detecting patterns that are beyond the
scope of human perception.
2.6. CYBER ATTACKS IN IOT

IoT devices are vulnerable to various attacks. reasons that makes the devices insecure
are: limitations in computational power, lack of transport encryption, insecure web
interfaces, lack of authentication/ authorisation mechanisms, and of course
heterogeneity, as it makes applying security mechanisms uniformly in IoT devices
extremely challenging. Below we discuss few of the most popular attacks to which IoT
are vulnerable:

Denial of Service (DoS) Attack : while DOS attack, the devices orresources are no
longer available to legitimate users. When multiple nodes on a network take part in
such an attack then it is called as Distributed Denial of Service (DDoS). attack affects
network resources, bandwidth and CPUetc.

A Denial-of-Service (DoS) attack is an attack meant to shut downs a machine or


network, making it inaccessible to the intended users. DoS attacks accomplished this by
flooding the target with the traffic, or sending it information that triggers a nd crash. In
both instances, the DoS attack takesback legitimate users (i.e… employees, students, or
others) of the service or resource they expected.

Two general methods of DoS attacks: one :flooding services and two:crashing
services. Flood attacks occur when the system receives too much traffic to the server to
buffer, causing them to slows down and eventually stops.

Other DoS attacks simply exploits vulnerabilities that cause the target systems or
service to crash. In these attacks, input is sent that takes advantage of bugs in the target
that subsequently crash / severely destabilize the systems, so that it can’t be accessed or
used.
other type of DoS attacks is the Distributed Denial of Service (DDoS) attack. A
DDoS attack occurs when multiple systems / chestrate a synchronized the DoS attack to
a single target. The essential difference that instead of being attacked from one
location…., the target , attacked from many different locations at once.

SYN FLOOD ATTACK :

SYN FLOOD is the form of DOS attack in which an attacker sends a succession of
SYN requests to a target's system in an attempt to consume enough server resources to
make the system unresponsive to legitimate traffic.

Generally, when a client attempts to start a TCP connection to a server, the client and
server exchange a series of messages which normally runs like this:

A SYN flood attack by not responding the server with the expected ACKnowledge
code. …. malicious client can either simply not send the expected ACK, / the source IP
address in the SYN, causing the server to send the SYN-ACK to a falsified IP address –
which will not send the ACK because it knows that it never sent a SYN.

The server wait for acknowledgement for some time as simple network congestion
could also be the cause of the missing ACK…in an attack, the half-open connections
created by the client bind resources on the server and may eventually exceed the
resources available on the server. At that point, the server cannot connect to any clients
whether legitimate / otherwise. This effectively denies service to legitimate clients.
Some systems may also malfunction or crash when other operating system functions are
starved of resources in this way….

2.7. INTRUSION DETECTION SYSTEM (IDS) IN IOT

An intrusion detection system (IDS) is a device or software application that monitors


a network or systems for malicious activity. IDS is classified into two systems. They are
NDS and host-based intrusion detection system (HIDS). A system that monitors
important os files is an example of an HIDS, while a system that analyzes coming
network traffic is an example of an NIDS.

2.7.1. INTRUSION DETECTION METHODS

There are two detection methods .They are signature based and anomaly detection
method.
Signature-based IDS refers to the detection of attacks by looking for specific patterns,
such as byte sequences in network traffic, or known malicious instruction sequences
used by malware.This terminology originates from anti-virus software, which refers to
these detected patterns as signatures. Although signature-based IDS can easily detect
known attacks, it is difficult to detect new attack..

Anomaly-based intrusion detection systems : is introduced to find unknown attacks,


in the part due to creation of the malwares. The basic is to use machine learning
approach to create a model trustworthy model, and then compare new behavior, against
this model.

Procedure

 Establish a BASELINE of Features reg. NORMAL network behavior.


Monitor the TRAFFIC to identify the unusual behavior.

CHAPTER-3

DESIGN AND ANALYSIS


3.1 UML(UNIFIED MODELLINGLANGUAGE)

UML (Unified Modeling Language). is a standard general-purpose modeling


language in the area of object-oriented programming. The standard is managed, and was
generated by Object Management Group.
The goal is to generate a common language for creating model for object oriented
computer software. In its current form UML is comprises of two major components: 1.aMeta-
modelandanotation.Inthefuture,some forms ofmethod or processmayalsobeadded or
associated with,UML.
The UML is a standard language for specifying, Constructing and documenting the
software system, for business modeling and other non-software systems.
The UML represents a bunch of great engineering practices have proven good in the
modeling of complex and large systems.
The UML is a very important role/ part of developing OOPS software and the
softwaredevelopmentp stoexpressthedesign of softwareprojects.

GOALS
 goals in modelling of the UML are asfollows:
 Provides users a ready-to-use, expressive visual modeling Language so that they
can develops and exchange meaningful models.
 Provide extendibility and specialization mechanisms to extend the coreconcepts.
 Be independent of particular programming languages and developmentprocess.
 Provide a formal basis for understanding the modelinglanguage.
 Encourages the growth of OO toolsmarket
 Supports higher level development concepts such as collaborations, frameworks,
patterns andcomponents.
 Integrates bestpractices.
3.2 Flowchart

flowchart isatypeofdiagramthatrepresentsaworkfloworprocess.A flowchartcan also be


defined as a diagrammatic representation of an algorithm, a step-by-step approaches to
solving atask.

3.3 Use CaseDiagrams

use case diagram in Unified Modeling Language (UML) type of behavioral diagram defined
by and created from a Use-case analysis. work is to present a
graphicaloverviewofthefunctionalityprovidedbyasystemintermsofactors,theirgoals (represented
as use cases), any dependency among the use cases. The imp motto of a use case diagram is to
depict,what the system functions is performed . Roles of the actors in the system can be
depicted

3.4 SEQUENCE DIAGRAM

sequence diagram in (UML) is tha kind of interaction diagramts


hatshowshowprocessesoperatewithoneanotherandinwhatorderItisaconstruct
ofaMessageSequenceChart.Sequencediagramsaresometimescalledevents diagrams,event
scenarios, and timingdiagrams.

CHAPTER 4

IMPLEMENTATION

4.1 Development Environment

The existing and proposed work have been implemented using Python on Windows operating system.

4.2 Python

is a commonly used general-purpose,and high level prog language. It was i designed by Guido van
Rossum 1991 and developed by Python Software Foundation. It is mainly developed for emphasis on
code readability, its syntax allows programmers to express concepts in fewer lines of code.
Python , the programming language that lets us work quickly and integrate systems more and more
efficiently.

Setting up on environment

Python community has developed many modules to help programmers implement machine learning.

Python Libraries

Scikit-Learn :

It implements a wide-range of machine-learning algorithms and makes it comfortable to plug them into
actual applications. You can use a whole slew of functions here like regression, clustering, model
selection, preprocessing, classification and more.The advantage of this is the high speed of work.

Numpy :

NumPy is shortened from Numerical Python..its the most universal and versatile library both for pros
and beginners. Using this tool you are up to operate with multi-dimensional arrays and matrices with

ease and comfort. Such functions like linear algebra operations and numerical conversions are also

available.

Command: pip install numpyscipyscikit-learn

Pandas:

Pandas is a well-known and high-performance tool for presenting data frames. Using it you can load

data from almost any source, calculate various functions and create new parameters, build queries to

data using aggregate functions akin to SQL. What is more, there are various matrix transformation

functions, a sliding window method and other methods for obtaining information from data.

Command: pip install pandas scipyscikit-learn

Matplotlib:

Matplotlib is a flexible library for creating graphs and visualization. It is powerful but somewhat heavy-

weight.

Command: pip install matplotlib scipyscikit-learn

4.2 Modules

1. Dataset

2. Preprocessing

3. Feature Extraction

4. Model Building

5. Evaluation

4.2.1 Dataset: The dataset includes different attributes and attacks. The dataset is divided into 80%
training data and 20% testing data randomly.
4.2.2 Preprocessing: To generate the features correctly the data must be preprocessed. The
preprocessing performs on both trained data and test data. The preprocessing step includes checking
missing values and performing label encoding, where it converts object variable to categorical
variable.
4.2.3. Feature Extraction: To extract relevant features and to improve algorithm performance
classifier method is used for feature extraction.
4.2.4. Model building: The model is built on the trained data. For training the data the classification
algorithms used are Naviebayes, K Nearest Neighbour and Skope rules method which is the type of
Rule based learning.
K-Nearest Neighbor
K-Nearest Neighbors its one of the most basic yet essential classification algorithms in Machine
Learning. It is of supervised learning domain and finds intense application in pattern recognition, data
mining and intrusion detection.
Approach:

1. Initialise the value of k


2. For getting predicted class, iterate from 1 to total number of training data points
1. Calculates the distance b/w test data nd each row of training data. Here we will use
Euclidean distance as our distance metric since it’s the most popular method. The other
metrics can be used Chebyshev, cosine, etc.
2. Sort the calculated distances in incresing order based on distance values
3. Gets top k rows from sorted array
4. Gets the most repeated class of the rows
5. Returns predicted class

Example :

Consider a dataset having two variables : height(cm) and weight(kg) and each point is classified as

Normal or Underweight.

On the basis of given data, classify the below set is normal or underweight using knn.

Given 57kg, 170cm .

According to the Euclidean formula, the distance between two points in the plane is

Naviebayes

It is the classification technique based on the Bayes’ Theorem with an assumption of independence
among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular
feature in a class is unrelated to the presence of any other feature.
Naive Bayes model is very easy to build and particularly useful for very very large data sets. Along
with the simplicity, Naive Bayes is known to outperform even highly sophisticated classification
methods.

𝑥
𝑐 𝑝( )𝑝(𝑐)
𝑐
𝑝 (𝑥)= 𝑝(𝑥)

Where,

P(c/X)=p(x1/c)*p(x2/c)*…………….p(xn/c)*p(c)

 P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
 P(c) is the prior probability of class.
 P(x|c) is the likelihood which is the probability of predictor given class.
 P(x) is the prior probability of predictor.

Approach:

 Calculate the prior probability for given class labels

 Find Likelihood probability with each attribute for each class

 Put these value in Bayes Formula and calculate posterior probability.

 See which class has a higher probability, given the input classifies to the higher probability
class.

Example :

Say you have 1000 fruits which could be either ‘banana’, ‘orange’ or ‘other’. These are the 3
possible classes of the Y variable.

We have data for the following X variables, all of which are binary (1 or 0).

 Long
 Sweet
 Yellow
 So the objective of the classifier is to predict if a given the fruit is a ‘Banana’ or
‘Orange’ or ‘Other’ when only the 3 features (long, sweet and yellow) are known.
 Let’s say you are given a fruit that is: Long, Sweet and Yellow, can you predict what
fruit it is?

 This is the same of predicting the Y when only the X variables in testing data are
known. Let’s solve it by hand using Naive Bayes.

 We have to compute the 3 probabilities, that is the probability of the fruit being a
banana, orange or other. Whichever fruit type gets the highest probability wins.

 All the information to calculate these probabilities is present in the above tabulation.

 Step 1: Compute the ‘Prior’ probabilities for all the class of fruits.

 That is, the proportion of each fruit class out of all the fruits from the population. You
can provide the ‘Priors’ from prior information about the population. Otherwise, it can
be computed from the training data.

 For this case, let’s compute from the training data. Out of 1000 records in training data,
you have 500 Bananas, 300 Oranges and 200 Others. So the respective priors are 0.5, 0.3
and 0.2.

 P(Y=Banana) = 500 / 1000 = 0.50

 P(Y=Orange) = 300 / 1000 = 0.30

 P(Y=Other) = 200 / 1000 = 0.20

 Step 2: Compute the probability of evidence that goes in the denominator.

 nothing but the product of P of Xs for all X. This is an optional step becauseog the
denominators is the same for all the classes and so will not affect the probabilities.

 P(x1=Long) = 500 / 1000 = 0.50

 P(x2=Sweet) = 650 / 1000 = 0.65

 P(x3=Yellow) = 800 / 1000 = 0.80

 Step 3: Compute the probability of likelihood of evidences that goes in the


numerator.

 It is the product of conditional probabilities of the 3 features. The formula, it says P(X1
|Y=k). Here X1 is ‘Long’ and k is ‘Banana’. That means the probability the fruit is
‘Long’ given that its a Banana. In the above table, you have 500 Ban anas. Out of that
400 is long. So, P(Long | Banana) = 400/500 = 0.8.

 Here, I have done it for Banana alone.

 Probability of Likelihood for Banana

 P(x1=Long | Y=Banana) = 400 / 500 = 0.80

 P(x2=Sweet | Y=Banana) = 350 / 500 = 0.70

 P(x3=Yellow | Y=Banana) = 450 / 500 = 0.90

 So, the overall probability of Likelihood of evidence for Banana = 0.8 * 0.7 * 0.9 =
0.504

 Step 4: Substitute all the 3 equations into the Naive Bayes formula to gets the
probability of banana.

 Similarly, you can compute the probabilities for ‘Orange’ and ‘Other fruit’. The
denominator is similar for all 3 cases, so it’s optional to compute.

 Clearly, Banana gets the highest probability, so that will be our predicted class.

 RULE BASED LEARNING:


 Rule-based machine learning (RBML) is a term in computer science intended to all machine
learning methods that identifies, learns, or evolves 'rules' to store, manipulate or apply. The
defining characteristic of a rule-based machine learner is the identification and utilization of a
set of relational rules that collectively represent the knowledge captured by the system. This is
in opposite to other machine learners that commonly identify a singular model that can can be
universally applied to any instance in order to make a prediction
 Rule-based machine learning approaches include learning classifier systems, association rule
learning, artificial immune systems, and any other method that relies on a set of rules, each
covering contextual knowledge.
 While rule-based machine learning type of rule-based system, it is distinct from traditional
rule-based systems, which are often hand-crafted, and other rule-based decision makers. This is
because rule-based machine learning applies some form of learning algorithm to automatically
identify useful rules, rather than a human needing to apply prior domain knowledge to
manually construct rules and curate a rule set.
 Skope rules:
 SkopeRules find logical rules with high precision and fuse them. Finding good rules is done by
fitting classification trees to sub-samples. A fitted tree defines a set of rules, rules are then
tested out of the bag, and the ones with higher precision are selected and merged. This
produces a real-valued decision function, reflecting for each new sample how many rules have
found it abnormal.
 4.2.5. Evaluation: The classification algorithms are evaluated using accuracy score, precision,
recall and f-score.
 True positive and true negatives are the observations that are correctly predicted and therefore
shown in green. We want to minimize false positives and false negatives so they are shown in
red color.
 True Positives (TP) - are the correct positive values which means that the value of actual class
is yes and the value of predicted positiveclass is also yes. (E.g). if actual class value indicates
that this passenger survives and predicted class tells you the same thing.
 True Negatives (TN) - are the correct negative values which means that the value of actual
class is no and value of predicted negative class is also no. (E.g.) if actual class says this
passenger did not survive yet and predicted class tells you the same thing.
 False positives and false negatives, these values occur when your actual value class contradicts
with the predicted class.
 False Positives (FP) – When the actual value class is no and predicted class is yes. E.g. if
actual class says this passenger did not survive yet but predicted class tells you, that this
passenger will survive.
 False Negatives (FN) – When the actual class is yes but predicted class in no. E.g. if actual
value indicates that this passenger survived and predicted class tells you that passenger will
die.
 Accuracy - Accuracy is most intuitive performance measure and it is simply a ratio of
correctly predicted observation to the total observations. One may think that, if we have high
accuracy then our model is best. Yes….., accuracy is a great measure but only when you have
symmetric datasets where values of false positive and false negatives are almost same.
Therefore, you have to look at other parameters to evaluates the performance of your model.
For our model, we have got 0.803 which means our model is approx. 80% accurate.
 Accuracy = TP+TN/TP+FP+FN+TN
 Precision - Precision is the ratio of correctly predicted positive observations to the total
predicted positive observations. The question that this metric answer is of all passengers that
labeled as survived, how many actually survived? High precision relates to the low false
positive rate. We have got 0.788 precision which is pretty good.
 Precision = TP/TP+FP
 Recall (Sensitivity) - Recall is ratio of mostly predicted positive observations to the all
observations in actual class - yes. The question recalls answers is: Of all the passengers that
really survived, how many did we label? We get 0.631 which is good for this model as it’s
above 0.5.
 Recall = TP/TP+FN
 F1 score - F1 Score is the weighted average of Precision and Recall. Therefore, this score takes
both false positives and false negatives into account. Intuitively it is not as easy to understand
as accuracy, but F1 is usually more helpful than accuracy, especially if you have an uneven
class distribution. Accuracy gives correct results if false positives and false negatives have
mostly same cost. If the cost of false positives and false negatives are different its good to look
at both Precision and Recall. In our case, F1 score is 0.701.
 F1 Score = 2*(Recall * Precision) / (Recall + Precision)

 4.3. Pseudo Code

 #import packages

 import pandas as pd
 import numpy as np

 import sklearn

 import sklearn.preprocessing
 import matplotlib.pyplot as plt

 from sklearn.preprocessing import


LabelEncoder
 from sklearn.model_selection import
train_test_split

 #from sklearn import ensemble


 from sklearn.ensemble import
RandomForestClassifier

 from sklearn import naive_bayes

 from sklearn import neighbors

 from sklearn.neighbors import


KNeighborsClassifier

 from sklearn.metrics import precision_score,


recall_score,accuracy_score
 from pandas import DataFrame

 import warnings
 warnings.filterwarnings("ignore")

 #read data

 df = pd.read_csv(r"F:\PROJECT\dtset.csv")
 #print dimensions of data
 print(df.shape)

 #read first five rows

 df.head(5)

 ##checking missing values

 print(df.isnull().sum())

 y=df.label
 X=df.drop('label',axis=1)
 ##split train and test data

 X_train, X_test, y_train, y_test =


train_test_split(X, y, test_size=0.2) # 80%
train and 20% test
 print(".............trining
dt.............",X_train,X_test)

 print(".............test
dt................",y_train,y_test)

 ##label encoder

 df_En=df

 le=LabelEncoder()

 # Iterating over all object data type common


columns

 for col in df.columns.values:

 # Encoding only categorical variables


 if df[col].dtypes=='object':

 # Converting object data type to


category

 df[col]=df[col].astype('category')

 # Using whole data to form an exhaustive


list of levels
 data=df[col]

 le.fit(data.values)
 df_En[col]=le.transform(df[col])

 y=df_En.label

 X=df_En.drop('label',axis=1)
 ##split train and test data

 X_train, X_test, y_train, y_test =


train_test_split(X, y, test_size=0.2) # 80%
train and 20% test

 ##feature extraction
 from sklearn.ensemble import
ExtraTreesClassifier

 model = ExtraTreesClassifier()

 model.fit(X_train,y_train)
 print(model.feature_importances_) #use
inbuilt class feature_importances of tree
based classifiers

 #plot graph of feature importances for better


visualization
 feat_importances =
pd.Series(model.feature_importances_,
index=X.columns)

 feat_importances.nlargest(10).plot(kind='barh
')

 plt.show()

 ##naviebayes
 from sklearn.naive_bayes import
BernoulliNB
 nb_model = naive_bayes.BernoulliNB()

 nb_model.fit(X_train, y_train)

 nb_predicted= model.predict(X_test)
 nb=accuracy_score(y_test,nb_predicted)

 print("naviebayes",nb)

 ##knn
 clf = KNeighborsClassifier()

 clf.fit(X_train, y_train)

 knn_y_pre = clf.predict(X_test)

 knn=accuracy_score(y_test,knn_y_pre)
 print("knn", knn)

 ##precision,recall,f-score.

 from sklearn.metrics import


classification_report

 print(classification_report(y_test,
knn_y_pre))

 ## plot bar graph

 acc_score=[nb,knn]
 col={'accuracy':acc_score}

 models=['NB','knn']

 df=DataFrame(data=col,index=models)

 print(df)
 df.plot(kind='bar')
 plt.show()

 Skope rules file


 #import libraries

 import numpy as np

 import pandas as pd
 import warnings

 warnings.filterwarnings("ignore")

 from skrules import SkopeRules

 from sklearn.preprocessing import


LabelEncoder

 df =
pd.read_csv(r"F:/PROJECT/iot_data.csv")

 print(df.head())
 ## lbel encoding

 df_En=df

 le=LabelEncoder()

 # Iterating over all object data type common


columns

 for col in df.columns.values:

 # Encoding only categorical variables

 if df[col].dtypes=='object':
 # Converting object data type to
category
 df[col]=df[col].astype('category')

 # Using whole data to form an exhaustive


list of levels
 data=df[col]

 le.fit(data.values)
 df_En[col]=le.transform(df[col])

 y1=df_En.label

 X1=df_En.drop('label',axis=1)

 features = X1.columns

 target_names=y1

 model =
SkopeRules(max_depth_duplication=2,

 n_estimators=30,
 precision_min=0.3,

 recall_min=0.1,
 feature_names=features)

 for idx, label in enumerate(df.label):

 X, y = X1,y1
 model.fit(X, y == idx)

 rules = model.rules_[0:40]

 print("Rule number ",idx)

 for rule in rules:


 print(rule)

 print()

 print(50*'-')
 print()
CHAPTER 5
TESTING

TESTING:-

The purpose of testing is to detect the errors. Testing ,the technique of trying to find
the faults / weaknesses in a work product.

5.1 TYPES OF TESTS


Testingisanimportantcomponentinsoftwarelifecycle.Itaidsinfinding the repairing
uncovered errors so that the software does not pose any problems to the vendor. Since the
system developed using an object oriented approach, class testing is done at unit level and
functional testing is done at systemlevel.

5.2 TESTINGTYPES
A software engineering product can be tested in one of two ways:

 Black boxtesting
 White boxtesting

5.2.1 BLACK BOXTESTING

by Knowing the particular function that a product has been modelled to perform and
explain whether every function is fully operational or not

5.2.2 WHITE BOXTESTING


by Knowing the internal workings of a software product explains whether the internal
operationanalyzingthefunctionsperformasperthespecification,andalltheinternalelements have
been adequatelyexercised.

5.3 TESTINGSTRATEGIES

Four Testing Strategies that are often are adopted by software development teams include:

 Unit Testing
 The FunctionalTesting
 PerformanceTest
 ValidationTesting

5.3.1 UNIT TESTING

Unittestingisakindofsoftwaretestingoutwhereinsingleelementsofasoftwarearechecked.
Theansweistocheckthateveryunitofthesoftwareperformsasmodelled.Aunitisthesmall
partofanysoftwareprogram.Itcontainsoneorafewinputsandgenerallyaindividualoutput. In a
structural programming, a unit can be a individual feature, manner, etc. In item-oriented
programming, the smallest unit is a way, which might also belong to a base/ exquisite
magnificence, summary elegance or derived (a few deal with a module of an application as a
unit. This is to be discouraged as there'll in all likelihood be manycharacter system inside that
module.)unittestingoutframeworks,drivers,stubs,andridicule/fauxgadgetsareusedtohelp in unit
checkingout.

5.3.2 FUNCTIONALTESTING

functional tests offer systematic demonstrations the capabilities examined are


availableasuniquebymeansoftheenterpriseandtechnicalnecessitie systemdocumentation, and
user manuals. Useful checking out is focussed on the followingitems:.

Functional testing centered on the following items:

Valid Inputs : known classes of valid input must be accepted.

InvalidInput : known classes of invalid input must be rejected.

Functional : specific functions must beexercised.

Output :particularclassesofapplicationoutputsmustbe exercised.

Systems/Procedures : interfacing systems or procedures must beinvoked.

5.3.3 PERFORMANCETEST

ThePerformancetestensuresthattheoutputisproducedwithinthetimelimits,andthe time
taken by the system for compiling, giving response to the users and request being sendto the
system for to retrieve the results.
5.3.4 VALIDATIONTESTING

Software program trying out and validation is performed for a series of black field tests
that reveal conformity with requirements. A check system defines the particular take a
look at instances so as to be used to demonstrate conformity with necessities. Each, the
plan and the process are designed to make sure that each one functional requirements are
done, documentation is accurate and other necessities are met. After every validation take
a look at case has been carried out… one of the two possible situations exists

CHAPTER 7

CONCLUSION

7.1. Work Carried Out


The purpose of this research to develop an adaptive IDS system tailored for IoT
ecosystems. Our proposed model, is a real-time network-based, both signature and
anomaly-based detection system.Following the data collection process, we built a
Machine Learning (ML) model which that is core of the proposed IDS. Specifically ,to
train our model, we used supervised ML algorithms. This model enables to recognise
abnormalities on the network traffic, even when unknown attacks are being deployed on
the network for the first time. Additionally, improves the accuracy of prediction we
used rule based algorithm.This algorithm consists of various rules, which will be used
in combination with the outcome that the ML model will produce.

7.2. Future Work

The results that we got are very prominent but still there are lot of elements and
parameters or attributes needed to consider to develop an adaptive IDS.
Furthermore, we need to develop for some more attacks and Additionally, we would
like to also consider other features for the ML training such as Payload,
Ingoing/Outgoing ratio, etc.Finally there should be focus to generate more rules by
using rule based algorithm.

You might also like