Ss
Ss
Ss
ABSTRACT
IoT (Internet of Things) has diverse applications, and smart home application is one
among them. As all devices of a smart home are accessible through Internet, providing
security and privacy becomes a challenge. In other words, the smart devices are
vulnerable to variety of attacks.
As a result, there is need to develop Intrusion Detection System (IDS) for
smart home application. There exists a model (an IDS solution) which consists of two
components. The first component (based on Machine Learning approach) monitors the
IoT network behavior. The second component (based on Rule based approach) is
established from the security policies configured by the network administrator. This
model predicts the malicious activity accurately and prevents the attacks.
1.1 MOTIVATION
Our work focuses on growing a unique version that can expect malicious
behaviour and come across malicious Iot nodes on a
network.Extramainly,themodelconsistsofcomponents.The primary element
based on a machine getting to know (ml) technique which learns the networking
behaviour of the Iot-primarily based community. The second factor is a
rulebasedapproach that is mounted from a safety policy configured by means of
the community administrator. The aggregate of both additives creates an
adaptive and bendy version, on the way to permit us to correctly in expecting
malicious interest and prevent security attackson suchstructures.
1.2 PROBLEMDEFINITION
REQUIREMENTS SOFTWAREREQUIREMENTS
LANGUAGE : PYTHON
HARDWARE REQUIREMENTS:
HARDWARE : PC
PROCESSOR : 2.4GHz Intel core two duo
RAM : 4 GB
CHAPTER 2
LITERATURE SURVEY
Machine learning is a discipline that deals with programming the systems so as to make
them automatically learn and improve with experience. Here, learning implies
recognizing and understanding the input data and taking informed decisions based on
the supplied data. Its very difficult to consider all the decisions based on all possible
inputs.
input to the learning algorithm is the training data, and the output is any expertise,
which usually is similar to another algorithm that can perform a task. The input data to a
machine learning system can be number, text, audio, video, or multimedia. The
corresponding output data of this system can be a float or integer for instance, the
velocity of a rocket, an integer representing a category or a class, for example, a pigeon
or a sunflower from image recognition.
A machine learning process exist of a number of typical steps. These steps are:
• Determine the problem you want to solve using machine learning technology
• Search and collect training data for your machine learning development process.
• Opt a machine learning model
• Prepare the collected data for training the machine learning model
• Test your learning system using test data
• Validate and enhance the machine learning model. You will need to search for more
training data within this iterative loop.
Supervised machine learning algorithms can apply to what we have already learnt in
the past to new data using labeled examples for predicting the future events. from the
analysis of a known as training dataset,the learning algorithm give an input function to
make analysis about the output values. system is able to provide target for any new
input after sufficient training. These learning algorithms can also compare its output
with the correct, needed output, to detect errors in order to modify the model
accordingly.
The Supervised learning is where you have input variables (x) and an output variable
(Y) and you use an algorithm to learn the mapping the function from input to output.
Y = f(X)
The goal is to approximate the mapping function so well that when you have new input
data (x) that you can predict the output variables (Y) for that data.
Supervised learning problems can furthur be classified into regression and classification
problems.
Unsupervised learning is where you only have input data (X) and not exact
corresponding output variables.
The aim of these type of learning is to model the underlying the structure or the
distribution of the data in order to learn more about the data.
This is called unsupervised learning because unlike supervised learning above there is
no correct answers and there is no teacher. Algorithms lefts their own devise to
discover and present the interesting structure in the data.
Clustering: A clustering problem is where you want to discover the inherent groupings
in the data, such as grouping customers by purchasing behavior.
Association: An association rule learning problem is where you want to discover rules
that describe large portions of your data, such as people that buy X also tend to buy Y.
Machine learning can also be seen as a branch of AI or Artificial Intelligence, since, the
ability to change experience into expertise or to detect patterns in complex data is a
mark of human or animal intelligence.
As a field of science, machine learning shares common concepts with other disciplines
such as statistics, information theory, game theory, and optimization.
As a subfield of IT is to program machines so that they will learn.
However, purpose of the ml is not building an
automated duplication of intelligence behaviors, but using the power of computers to
complement and supplement human intelligence. For example, machine learning
programs can scan and process huge databases detecting patterns that are beyond the
scope of human perception.
2.6. CYBER ATTACKS IN IOT
IoT devices are vulnerable to various attacks. reasons that makes the devices insecure
are: limitations in computational power, lack of transport encryption, insecure web
interfaces, lack of authentication/ authorisation mechanisms, and of course
heterogeneity, as it makes applying security mechanisms uniformly in IoT devices
extremely challenging. Below we discuss few of the most popular attacks to which IoT
are vulnerable:
Denial of Service (DoS) Attack : while DOS attack, the devices orresources are no
longer available to legitimate users. When multiple nodes on a network take part in
such an attack then it is called as Distributed Denial of Service (DDoS). attack affects
network resources, bandwidth and CPUetc.
Two general methods of DoS attacks: one :flooding services and two:crashing
services. Flood attacks occur when the system receives too much traffic to the server to
buffer, causing them to slows down and eventually stops.
Other DoS attacks simply exploits vulnerabilities that cause the target systems or
service to crash. In these attacks, input is sent that takes advantage of bugs in the target
that subsequently crash / severely destabilize the systems, so that it can’t be accessed or
used.
other type of DoS attacks is the Distributed Denial of Service (DDoS) attack. A
DDoS attack occurs when multiple systems / chestrate a synchronized the DoS attack to
a single target. The essential difference that instead of being attacked from one
location…., the target , attacked from many different locations at once.
SYN FLOOD is the form of DOS attack in which an attacker sends a succession of
SYN requests to a target's system in an attempt to consume enough server resources to
make the system unresponsive to legitimate traffic.
Generally, when a client attempts to start a TCP connection to a server, the client and
server exchange a series of messages which normally runs like this:
A SYN flood attack by not responding the server with the expected ACKnowledge
code. …. malicious client can either simply not send the expected ACK, / the source IP
address in the SYN, causing the server to send the SYN-ACK to a falsified IP address –
which will not send the ACK because it knows that it never sent a SYN.
The server wait for acknowledgement for some time as simple network congestion
could also be the cause of the missing ACK…in an attack, the half-open connections
created by the client bind resources on the server and may eventually exceed the
resources available on the server. At that point, the server cannot connect to any clients
whether legitimate / otherwise. This effectively denies service to legitimate clients.
Some systems may also malfunction or crash when other operating system functions are
starved of resources in this way….
There are two detection methods .They are signature based and anomaly detection
method.
Signature-based IDS refers to the detection of attacks by looking for specific patterns,
such as byte sequences in network traffic, or known malicious instruction sequences
used by malware.This terminology originates from anti-virus software, which refers to
these detected patterns as signatures. Although signature-based IDS can easily detect
known attacks, it is difficult to detect new attack..
Procedure
CHAPTER-3
GOALS
goals in modelling of the UML are asfollows:
Provides users a ready-to-use, expressive visual modeling Language so that they
can develops and exchange meaningful models.
Provide extendibility and specialization mechanisms to extend the coreconcepts.
Be independent of particular programming languages and developmentprocess.
Provide a formal basis for understanding the modelinglanguage.
Encourages the growth of OO toolsmarket
Supports higher level development concepts such as collaborations, frameworks,
patterns andcomponents.
Integrates bestpractices.
3.2 Flowchart
use case diagram in Unified Modeling Language (UML) type of behavioral diagram defined
by and created from a Use-case analysis. work is to present a
graphicaloverviewofthefunctionalityprovidedbyasystemintermsofactors,theirgoals (represented
as use cases), any dependency among the use cases. The imp motto of a use case diagram is to
depict,what the system functions is performed . Roles of the actors in the system can be
depicted
CHAPTER 4
IMPLEMENTATION
The existing and proposed work have been implemented using Python on Windows operating system.
4.2 Python
is a commonly used general-purpose,and high level prog language. It was i designed by Guido van
Rossum 1991 and developed by Python Software Foundation. It is mainly developed for emphasis on
code readability, its syntax allows programmers to express concepts in fewer lines of code.
Python , the programming language that lets us work quickly and integrate systems more and more
efficiently.
Setting up on environment
Python community has developed many modules to help programmers implement machine learning.
Python Libraries
Scikit-Learn :
It implements a wide-range of machine-learning algorithms and makes it comfortable to plug them into
actual applications. You can use a whole slew of functions here like regression, clustering, model
selection, preprocessing, classification and more.The advantage of this is the high speed of work.
Numpy :
NumPy is shortened from Numerical Python..its the most universal and versatile library both for pros
and beginners. Using this tool you are up to operate with multi-dimensional arrays and matrices with
ease and comfort. Such functions like linear algebra operations and numerical conversions are also
available.
Pandas:
Pandas is a well-known and high-performance tool for presenting data frames. Using it you can load
data from almost any source, calculate various functions and create new parameters, build queries to
data using aggregate functions akin to SQL. What is more, there are various matrix transformation
functions, a sliding window method and other methods for obtaining information from data.
Matplotlib:
Matplotlib is a flexible library for creating graphs and visualization. It is powerful but somewhat heavy-
weight.
4.2 Modules
1. Dataset
2. Preprocessing
3. Feature Extraction
4. Model Building
5. Evaluation
4.2.1 Dataset: The dataset includes different attributes and attacks. The dataset is divided into 80%
training data and 20% testing data randomly.
4.2.2 Preprocessing: To generate the features correctly the data must be preprocessed. The
preprocessing performs on both trained data and test data. The preprocessing step includes checking
missing values and performing label encoding, where it converts object variable to categorical
variable.
4.2.3. Feature Extraction: To extract relevant features and to improve algorithm performance
classifier method is used for feature extraction.
4.2.4. Model building: The model is built on the trained data. For training the data the classification
algorithms used are Naviebayes, K Nearest Neighbour and Skope rules method which is the type of
Rule based learning.
K-Nearest Neighbor
K-Nearest Neighbors its one of the most basic yet essential classification algorithms in Machine
Learning. It is of supervised learning domain and finds intense application in pattern recognition, data
mining and intrusion detection.
Approach:
Example :
Consider a dataset having two variables : height(cm) and weight(kg) and each point is classified as
Normal or Underweight.
On the basis of given data, classify the below set is normal or underweight using knn.
According to the Euclidean formula, the distance between two points in the plane is
Naviebayes
It is the classification technique based on the Bayes’ Theorem with an assumption of independence
among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular
feature in a class is unrelated to the presence of any other feature.
Naive Bayes model is very easy to build and particularly useful for very very large data sets. Along
with the simplicity, Naive Bayes is known to outperform even highly sophisticated classification
methods.
𝑥
𝑐 𝑝( )𝑝(𝑐)
𝑐
𝑝 (𝑥)= 𝑝(𝑥)
Where,
P(c/X)=p(x1/c)*p(x2/c)*…………….p(xn/c)*p(c)
P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
P(c) is the prior probability of class.
P(x|c) is the likelihood which is the probability of predictor given class.
P(x) is the prior probability of predictor.
Approach:
See which class has a higher probability, given the input classifies to the higher probability
class.
Example :
Say you have 1000 fruits which could be either ‘banana’, ‘orange’ or ‘other’. These are the 3
possible classes of the Y variable.
We have data for the following X variables, all of which are binary (1 or 0).
Long
Sweet
Yellow
So the objective of the classifier is to predict if a given the fruit is a ‘Banana’ or
‘Orange’ or ‘Other’ when only the 3 features (long, sweet and yellow) are known.
Let’s say you are given a fruit that is: Long, Sweet and Yellow, can you predict what
fruit it is?
This is the same of predicting the Y when only the X variables in testing data are
known. Let’s solve it by hand using Naive Bayes.
We have to compute the 3 probabilities, that is the probability of the fruit being a
banana, orange or other. Whichever fruit type gets the highest probability wins.
All the information to calculate these probabilities is present in the above tabulation.
Step 1: Compute the ‘Prior’ probabilities for all the class of fruits.
That is, the proportion of each fruit class out of all the fruits from the population. You
can provide the ‘Priors’ from prior information about the population. Otherwise, it can
be computed from the training data.
For this case, let’s compute from the training data. Out of 1000 records in training data,
you have 500 Bananas, 300 Oranges and 200 Others. So the respective priors are 0.5, 0.3
and 0.2.
nothing but the product of P of Xs for all X. This is an optional step becauseog the
denominators is the same for all the classes and so will not affect the probabilities.
It is the product of conditional probabilities of the 3 features. The formula, it says P(X1
|Y=k). Here X1 is ‘Long’ and k is ‘Banana’. That means the probability the fruit is
‘Long’ given that its a Banana. In the above table, you have 500 Ban anas. Out of that
400 is long. So, P(Long | Banana) = 400/500 = 0.8.
So, the overall probability of Likelihood of evidence for Banana = 0.8 * 0.7 * 0.9 =
0.504
Step 4: Substitute all the 3 equations into the Naive Bayes formula to gets the
probability of banana.
Similarly, you can compute the probabilities for ‘Orange’ and ‘Other fruit’. The
denominator is similar for all 3 cases, so it’s optional to compute.
Clearly, Banana gets the highest probability, so that will be our predicted class.
#import packages
import pandas as pd
import numpy as np
import sklearn
import sklearn.preprocessing
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
#read data
df = pd.read_csv(r"F:\PROJECT\dtset.csv")
#print dimensions of data
print(df.shape)
df.head(5)
print(df.isnull().sum())
y=df.label
X=df.drop('label',axis=1)
##split train and test data
print(".............test
dt................",y_train,y_test)
##label encoder
df_En=df
le=LabelEncoder()
df[col]=df[col].astype('category')
le.fit(data.values)
df_En[col]=le.transform(df[col])
y=df_En.label
X=df_En.drop('label',axis=1)
##split train and test data
##feature extraction
from sklearn.ensemble import
ExtraTreesClassifier
model = ExtraTreesClassifier()
model.fit(X_train,y_train)
print(model.feature_importances_) #use
inbuilt class feature_importances of tree
based classifiers
feat_importances.nlargest(10).plot(kind='barh
')
plt.show()
##naviebayes
from sklearn.naive_bayes import
BernoulliNB
nb_model = naive_bayes.BernoulliNB()
nb_model.fit(X_train, y_train)
nb_predicted= model.predict(X_test)
nb=accuracy_score(y_test,nb_predicted)
print("naviebayes",nb)
##knn
clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
knn_y_pre = clf.predict(X_test)
knn=accuracy_score(y_test,knn_y_pre)
print("knn", knn)
##precision,recall,f-score.
print(classification_report(y_test,
knn_y_pre))
acc_score=[nb,knn]
col={'accuracy':acc_score}
models=['NB','knn']
df=DataFrame(data=col,index=models)
print(df)
df.plot(kind='bar')
plt.show()
#import libraries
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
df =
pd.read_csv(r"F:/PROJECT/iot_data.csv")
print(df.head())
## lbel encoding
df_En=df
le=LabelEncoder()
if df[col].dtypes=='object':
# Converting object data type to
category
df[col]=df[col].astype('category')
le.fit(data.values)
df_En[col]=le.transform(df[col])
y1=df_En.label
X1=df_En.drop('label',axis=1)
features = X1.columns
target_names=y1
model =
SkopeRules(max_depth_duplication=2,
n_estimators=30,
precision_min=0.3,
recall_min=0.1,
feature_names=features)
X, y = X1,y1
model.fit(X, y == idx)
rules = model.rules_[0:40]
print()
print(50*'-')
print()
CHAPTER 5
TESTING
TESTING:-
The purpose of testing is to detect the errors. Testing ,the technique of trying to find
the faults / weaknesses in a work product.
5.2 TESTINGTYPES
A software engineering product can be tested in one of two ways:
Black boxtesting
White boxtesting
by Knowing the particular function that a product has been modelled to perform and
explain whether every function is fully operational or not
5.3 TESTINGSTRATEGIES
Four Testing Strategies that are often are adopted by software development teams include:
Unit Testing
The FunctionalTesting
PerformanceTest
ValidationTesting
Unittestingisakindofsoftwaretestingoutwhereinsingleelementsofasoftwarearechecked.
Theansweistocheckthateveryunitofthesoftwareperformsasmodelled.Aunitisthesmall
partofanysoftwareprogram.Itcontainsoneorafewinputsandgenerallyaindividualoutput. In a
structural programming, a unit can be a individual feature, manner, etc. In item-oriented
programming, the smallest unit is a way, which might also belong to a base/ exquisite
magnificence, summary elegance or derived (a few deal with a module of an application as a
unit. This is to be discouraged as there'll in all likelihood be manycharacter system inside that
module.)unittestingoutframeworks,drivers,stubs,andridicule/fauxgadgetsareusedtohelp in unit
checkingout.
5.3.2 FUNCTIONALTESTING
5.3.3 PERFORMANCETEST
ThePerformancetestensuresthattheoutputisproducedwithinthetimelimits,andthe time
taken by the system for compiling, giving response to the users and request being sendto the
system for to retrieve the results.
5.3.4 VALIDATIONTESTING
Software program trying out and validation is performed for a series of black field tests
that reveal conformity with requirements. A check system defines the particular take a
look at instances so as to be used to demonstrate conformity with necessities. Each, the
plan and the process are designed to make sure that each one functional requirements are
done, documentation is accurate and other necessities are met. After every validation take
a look at case has been carried out… one of the two possible situations exists
CHAPTER 7
CONCLUSION
The results that we got are very prominent but still there are lot of elements and
parameters or attributes needed to consider to develop an adaptive IDS.
Furthermore, we need to develop for some more attacks and Additionally, we would
like to also consider other features for the ML training such as Payload,
Ingoing/Outgoing ratio, etc.Finally there should be focus to generate more rules by
using rule based algorithm.