documentation sample
documentation sample
BACHELOR OF TECHNOLOGY
in
INFORMATION TECHNOLOGY
Submitted by
OCTOBER – 2023
1
St. MARTIN'S ENGINEERING COLLEGE
UGC Autonomous
Affiliated to JNTUH, Approved by AICTE
NBA & NAAC A+ Accredited
Dhulapally, Secunderabad - 500 100
www.smec.ac.in
Certificate
Place:
Date:
2
St. MARTIN'S ENGINEERING COLLEGE
UGC Autonomous
Affiliated to JNTUH, Approved by AICTE
NBA & NAAC A+ Accredited
Dhulapally, Secunderabad - 500 100
www.smec.ac.in
Declaration
We, the students of ‘Bachelor of Technology in Department of Information
Technology’, session: 2020 - 2024, St. Martin’s Engineering College, Dhulapally,
Kompally, Secunderabad, hereby declare that the work presented in this Project
Work entitled “MACHINE LEARNING MODEL FOR MESSAGE QUEUING
TELEMETRY TRANSPORT DATA ANALYTICS” is the outcome of our own
bonafide work and is correct to the best of our knowledge and this work has been
undertaken taking care of Engineering Ethics. This result embodied in this project
report has not been submitted in any university for award of any degree.
3
ACKNOWLEDGEMENT
i
CONTENTS
ACKNOWLEDGMENT i
ABSTRACT ii
LIST OF FIGURES iv
LIST OF TABLES v
ABSTRACT vii
CHAPTER 1: INTRODUCTION 1
3.2 Limitations 9
4.1 Overview 10
ii
10.2 Dataset Description 52
iii
LIST OF FIGURES
iv
LIST OF TABLES
v
LIST OF ACRONYMS AND DEFINITIONS
vi
ABSTRACT
vii
CHAPTER 1
INTRODUCTION
1.1 Overview
When the Internet of Things (IoT) is implemented, physical devices (also known as
IoT nodes) are connected to the internet, enabling them to collect and exchange data
with other nodes in the network without the need for human participation [1].
Message queuing telemetry transfer (MQTT) gained widespread use in a range of
applications, such as in smart homes [2–4], agricultural IoT [5, 6], and industrial
applications. This is mainly due to its capacity to communicate at low bandwidths, the
necessity for minimum memory, and reduced packet loss. Figure 1 depicts the
architecture of the MQTT protocol for use in the IoT.
The IoT and associated technologies evolved at a rapid rate, with 15 billion linked
devices in 2015, which is likely to increase to 38 billion devices by 2025, according to
Gartner [7]. The IoT is a network of objects—linked by sensors, actuators, gateways,
and cloud services—that delivers a service to users. The MQTT protocol was
integrated into a number of IoT applications. Figure 2 depicts how the MQTT
protocol maintains IoT applications. Traditional intrusion detection systems (IDSs)
are only successful when dealing with data that move slowly or with small volumes of
data. They are currently inefficient when dealing with big data or networks and are
unable to cope with high-speed data transmission. Therefore, technologies capable of.
1
dealing with massive volumes of data and identifying any indications of network
penetration are crucial. When it comes to big data, data security and privacy are
perhaps the most pressing concerns, especially in the context of network assaults.
Distributed denial-of-service (DDoS) attacks are one of the most common types of
cyberattacks. They target servers or networks with the intent of interfering with their
normal operation. Although real-time detection and mitigation of DDoS attacks is
difficult to achieve, a solution would be extremely valuable, since attacks can cause
significant damage.
2
CHAPTER 2
LITERATURE SURVEY
Machine learning (ML) studies are always being improved through the use of
training data and the exploitation of available information. Some consider ML to be a
component of artificial intelligence. Depending on the information provided, various
types of learning can be undertaken, including supervised learning—for example, the
support vector machine (SVM) algorithm and the k-nearest neighbors (KNN)
algorithm—semi-supervised learning, and unsupervised learning (e.g., clustering
methods). Deep learning (DL) models combined with ML techniques produce
excellent results in cybersecurity systems used for detecting attacks. ML techniques
are used in multiple contexts, such as in healthcare. For example, they are being used
to forecast COVID-19 outbreaks, osteoporosis, and schistosomiasis, among other
health-related problems [8, 9]. Many researchers employed classification algorithms
to detect and resolve DDoS attacks with the goal of reducing the number of attacks.
DDoS attacks are simple to carry out because they take advantage of network flaws
and generate requests for software services [10, 11]. DDoS attacks take a long time to
identify and neutralize, and this solution is particularly useful, since these attacks may
cause major harm. There are significant drawbacks to the current methods used to
detect DDoS attacks, such as high processing costs and the inability to handle
enormous quantities of data reaching the server [12]. Using a variety of classification
methods, classification algorithms differentiate DDoS packets from other kinds of
packets [13–15]. To secure the IoT against anomalous adversarial attacks, various
security-enhancing solutions were developed. These approaches are often used to
detect attacks in IoT networks by monitoring IoT node operations, such as the rate at
which data are sent. In this paper, we introduce a brief review of the literature to
highlight recent advancements in IoT security systems, with a particular emphasis on
IDSs that target the MQTT protocol. The authors of [16] provided a process tree-
based intrusion detection technique for MQTT protocols based on their previous
work. It describes network behavior in terms of the hierarchical branches of a tree,
which can then be used to detect assaults or aberrant behavior in the network. The
detection rate was used to evaluate the model, and four frequent types of assaults were
3
introduced into the network to assess its performance. However, little consideration
was given to newly created adversarial attacks and intrusions.
The study [17] proposes a fuzzy logic model for intrusion detection that is specifically
built to safeguard IoT nodes that use the MQTT protocol from denial-of-service
(DoS) attacks. Although fuzzy logic demonstrated its efficacy in a variety of systems,
including sensor device intrusion detection in the IoT [18], its high difficulty with
increasing input dimensions limits its ability to detect attacks on IoT platforms where
large amounts of data are transferred on a continuous basis. The extreme gradient
boosting (XGBoost) algorithm gated recurrent units (GRUs), and LSTM are only a
few of the ML methods used in [19, 20] to create a cybersecurity system for the
MQTT protocol in the IoT. The MQTT dataset, which contains three forms of attacks,
including intrusion (illegal entrance), DoS, and malicious code injection and man-in-
the-middle attack (MitM), was used to verify the proposed techniques. To test a range
of ML approaches, the MQTT-IoTIDS2020 dataset was used. Using these ML
approaches, it was found that a system for detecting MQTT attacks could be designed,
and this was later validated by the researchers. An MQTT-enabled IoT cybersecurity
system demonstrated the use of an ANN approach for intrusion detection [21].
Ujjan et al. [22] presented an entropy-based features section to identify the important
features in network traffic for detecting DoS attacks by employing an encoder (SAE)
and CNN models. CPU consumption was significantly higher and took significantly
longer. The models were accurate to within 94% and 93% of the true value,
respectively. Using LSTM and CNN, Gadze et al. [23] introduced DL models to
identify DDoS intrusions on a software-defined network’s centralized controller
(CNN). The accuracy of the models was lower than expected. When data were split
out in a 70/30 ratio, the accuracies of LSTM and CNN were 89.63% and 66%,
respectively. However, when using an LSTM model to detect intrusions in network
traffic, DDoS was found to be the most time-consuming attempt out of all 10 attempts
tested. A hybrid ML model, SVM combined with random forest (SVC-RF), was
created by Ahuja et al. [24] and used to distinguish between benign and malicious
traffic. The authors extracted features from the original dataset that were used to build
a new dataset: the SDN dataset, which had innovative features. It was determined that
the SVC-RF classifier is capable of accurately categorizing data traffic with an
accuracy of 98.8% when using the software defined networking (SDN) dataset.
4
Wang et al. [25] revealed that a unique DL model based on an upgraded deep belief
network (DBN) can be used to identify network intrusions more quickly. They
replaced the back propagation approach in DBN with a kernel-based extreme learning
machine (KELM), which was created by the researchers and is still in development.
Their model outperformed other current neural network approaches by a wide margin.
In this study, the researchers examined and tested the accuracy of a number of
different categorization algorithms and techniques. The results reveal that the DBN-
KELM algorithm obtained an accuracy of 93.5%, while the DBN-EGWO-KELM
method achieved an accuracy of 98.60%.
5
CHAPTER 3
EXISTING TECHNIQUE
Suppose there are two categories, i.e., Category A and Category B, and we have a
new data point x1, so this data point will lie in which of these categories. To solve this
type of problem, we need a K-NN algorithm. With the help of K-NN, we can easily
identify the category or class of a particular dataset. Consider the below diagram:
Fig
. 3.1: KNN on dataset.
6
How does K-NN work?
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of the data points in each
category.
Step-5: Assign the new data points to that category for which the number of the
neighbor is maximum.
Suppose we have a new data point, and we need to put it in the required category.
Consider the below image:
Firstly, we will choose the number of neighbors, so we will choose the k=5.
Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in
geometry. It can be calculated as:
7
Fig. 3.3: Measuring of Euclidean distance.
By calculating the Euclidean distance, we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:
As we can see the 3 nearest neighbors are from category A, hence this new data point
must belong to category A.
Below are some points to remember while selecting the value of K in the K-NN
algorithm:
8
• There is no particular way to determine the best value for "K", so we need to
try some values to find the best out of them. The most preferred value for K is
5.
• A very low value for K such as K=1 or K=2, can be noisy and lead to the
effects of outliers in the model.
• Large values for K are good, but it may find some difficulties.
3.2 Limitations
9
CHAPTER 4
PROPOSED METHODOLOGY
4.1 Overview
This research represents a project for building a machine learning model for MQTT
data analytics, which is a lightweight messaging protocol commonly used in IoT and
other applications. The main goal of this work is to develop machine learning models
that can classify MQTT data into different categories or message types. These
categories might represent different types of MQTT messages or events, such as
legitimate messages, denial-of-service (DoS) attacks, malformed messages, brute
force attacks, and floods. Here's an overview of the main components:
• Load MQTT Data: The project begins by loading MQTT data from a CSV file
named "mqttdataset_reduced.csv" into a Pandas DataFrame.
• Data Cleaning: Replace a specific value (0x00000000) with 0 in the dataset.
• Data Encoding: Encode categorical variables using Label Encoding. This is
necessary for machine learning algorithms, which typically require numerical
input.
10
Step 4: Model Evaluation
• Accuracy Calculation: Calculate and print the accuracy of both the Random
Forest and KNN classifiers on the test data.
• Classification Report: Generate a detailed classification report for both
models, including metrics such as precision, recall, F1-score, and support for
each class.
• Confusion Matrix: Create confusion matrix heatmaps to visualize the model's
performance in terms of true positive, true negative, false positive, and false
negative predictions.
• After training and evaluating the models, the code performs predictions on the
test data for both the Random Forest and KNN classifiers.
• For each prediction, it determines the predicted class label (e.g., 'legitimate',
'dos', 'slowite', etc.) and prints the features of the test data point along with the
predicted class label.
In summary, this project aims to automate the classification of MQTT data into
different message types using machine learning. It includes data preprocessing, model
training and evaluation, and visualizations to assess and understand the models'
performance. Additionally, it provides a way to save and load pre-trained models to
avoid retraining when working with new data.
11
4.2 Data Preprocessing
Data pre-processing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine
learning model. When creating a machine learning project, it is not always a case that
we come across the clean and formatted data. And while doing any operation with
data, it is mandatory to clean it and put in a formatted way. So, for this, we use data
pre-processing task. A real-world data generally contains noises, missing values, and
maybe in an unusable format which cannot be directly used for machine learning
models. Data pre-processing is required tasks for cleaning the data and making it
suitable for a machine learning model which also increases the accuracy and
efficiency of a machine learning model.
Numpy: Numpy Python library is used for including any type of mathematical
operation in the code. It is the fundamental package for scientific calculation in
Python. It also supports to add large, multidimensional arrays and matrices. So, in
Python, we can import it as:
import numpy as nm
Here we have used nm, which is a short name for Numpy, and it will be used in the
whole program.
12
import matplotlib.pyplot as mpt
Pandas: The last library is the Pandas library, which is one of the most famous
Python libraries and used for importing and managing the datasets. It is an open-
source data manipulation and analysis library. Here, we have used pd as a short name
for this library. Consider the below image:
Feature Scaling: Feature scaling is the final step of data preprocessing in machine
learning. It is a technique to standardize the independent variables of the dataset in a
specific range. In feature scaling, we put our variables in the same range and in the
same scale so that no variable dominates the other variable. A machine learning
model is based on Euclidean distance, and if we do not scale the variable, then it will
cause some issue in our machine learning model. Euclidean distance is given as:
If we compute any two values from age and salary, then salary values will dominate
the age values, and it will produce an incorrect result. So, to remove this issue, we
need to perform feature scaling for machine learning.
13
4.3 Splitting the Dataset
In machine learning data preprocessing, we divide our dataset into a training set and
test set. This is one of the crucial steps of data preprocessing as by doing this, we can
enhance the performance of our machine learning model. Suppose if we have given
training to our machine learning model by a dataset and we test it by a completely
different dataset. Then, it will create difficulties for our model to understand the
correlations between the models. If we train our model very well and its training
accuracy is also very high, but we provide a new dataset to it, then it will decrease the
performance. So we always try to make a machine learning model which performs
well with the training set and also with the test dataset. Here, we can define these
datasets as:
Training Set: A subset of dataset to train the machine learning model, and we already
know the output.
Test set: A subset of dataset to test the machine learning model, and by using the test
set, model predicts the output.
For splitting the dataset, we will use the below lines of code:
Explanation
• In the above code, the first line is used for splitting arrays of the dataset into
random train and test subsets.
• In the second line, we have used four variables for our output that are
14
• x_test: features for testing data
• The last parameter random_state is used to set a seed for a random generator
so that you always get the same result, and the most used value for this is 42.
15
Random Forest algorithm
Step 1: In Random Forest n number of random records are taken from the data set
having k number of records.
Since the random forest combines multiple trees to predict the class of the dataset, it is
possible that some decision trees may predict the correct output, while others may not.
But together, all the trees predict the correct output. Therefore, below are two
assumptions for a better Random forest classifier:
• There should be some actual values in the feature variable of the dataset so
that the classifier can predict accurate results rather than a guessed result.
• The predictions from each tree must have very low correlations.
16
Below are some points that explain why we should use the Random Forest algorithm
Types of Ensembles
Before understanding the working of the random forest, we must look into the
ensemble technique. Ensemble simply means combining multiple models. Thus, a
collection of models is used to make predictions rather than an individual model.
Ensemble uses two types of methods:
Bagging– It creates a different training subset from sample training data with
replacement & the final output is based on majority voting. For example, Random
Forest. Bagging, also known as Bootstrap Aggregation is the ensemble technique used
by random forest. Bagging chooses a random sample from the data set. Hence each
model is generated from the samples (Bootstrap Samples) provided by the Original
Data with replacement known as row sampling. This step of row sampling with
replacement is called bootstrap. Now each model is trained independently which
generates results. The final output is based on majority voting after combining the
results of all models. This step which involves combining all the results and
generating output based on majority voting is known as aggregation.
17
Boosting– It combines weak learners into strong learners by creating sequential
models such that the final model has the highest accuracy. For example, ADA
BOOST, XG BOOST.
Working
The Random Forest Classifier is an ensemble learning method that combines multiple
decision trees to make predictions. Each decision tree in the forest is constructed
based on a random subset of the dataset and a random subset of features. This
randomness helps reduce overfitting and improves generalization. In this project, the
Random Forest Classifier is trained using the preprocessed MQTT dataset. If a pre-
trained model exists in a file ('randomforestmodel.pkl'), it is loaded to avoid
retraining. Otherwise, a new Random Forest model is created and trained on the
training data. After training, it is capable of making predictions. It can classify MQTT
data into different message types based on the patterns it has learned during training.
Here, predictions are made on the test dataset. The accuracy of the model, which
represents the proportion of correctly classified samples is calculated by comparing its
predictions to the actual target values in the test dataset. Then after, a classification
report is generated to provide a more detailed evaluation of the model's performance.
It includes metrics such as precision, recall, F1-score, and support for each class
(message type). This report offers insights into the model's strengths and weaknesses
for different classes. Finally, it visualizes the model’s performance using a confusion
matrix heatmap. This matrix helps assess the model's ability to correctly classify
instances and identify any misclassifications.
18
Key Features
19
CHAPTER 5
UML DIAGRAMS
GOALS: The Primary goals in the design of the UML are as follows:
20
5.1 Class Diagram
The class diagram is used to refine the use case diagram and define a detailed design
of the system. The class diagram classifies the actors defined in the use case diagram
into a set of interrelated classes. The relationship or association between the classes
can be either an “is-a” or “has-a” relationship. Each class in the class diagram may be
capable of providing certain functionalities. These functionalities provided by the
class are termed “methods” of the class. Apart from this, each class may have certain
“attributes” that uniquely identify the class.
21
5.2 Data flow diagram
A Data Flow Diagram (DFD) is a visual representation of the flow of data within a
system or process. It is a structured technique that focuses on how data moves through
different processes and data stores within an organization or a system. DFDs are
commonly used in system analysis and design to understand, document, and
communicate data flow and processing.
22
5.3 Sequence Diagram
23
5.4 Activity diagram
24
CHAPTER 6
SOFTWARE ENVIRONMENT
What is Python?
The biggest strength of Python is huge collection of standard library which can be
used for the following –
• Machine Learning
• GUI Applications (like Kivy, Tkinter, PyQt etc. )
• Web frameworks like Django (used by YouTube, Instagram, Dropbox)
• Image processing (like Opencv, Pillow)
• Web scraping (like Scrapy, BeautifulSoup, Selenium)
• Test frameworks
• Multimedia
Advantages of Python
1. Extensive Libraries
Python downloads with an extensive library and it contain code for various purposes
like regular expressions, documentation-generation, unit-testing, web browsers,
25
threading, databases, CGI, email, image manipulation, and more. So, we don’t have to
write the complete code for that manually.
2. Extensible
As we have seen earlier, Python can be extended to other languages. You can write
some of your code in languages like C++ or C. This comes in handy, especially in
projects.
3. Embeddable
4. Improved Productivity
5. IOT Opportunities
Since Python forms the basis of new platforms like Raspberry Pi, it finds the future
bright for the Internet Of Things. This is a way to connect the language with the real
world.
When working with Java, you may have to create a class to print ‘Hello World’. But
in Python, just a print statement will do. It is also quite easy to learn, understand, and
code. This is why when people pick up Python, they have a hard time adjusting to
other more verbose languages like Java.
7. Readable
Because it is not such a verbose language, reading Python is much like reading
English. This is the reason why it is so easy to learn, understand, and code. It also
does not need curly braces to define blocks, and indentation is mandatory. This
further aids the readability of the code.
26
8. Object-Oriented
Like we said earlier, Python is freely available. But not only can you download
Python for free, but you can also download its source code, make changes to it, and
even distribute it. It downloads with an extensive collection of libraries to help you
with your tasks.
10. Portable
When you code your project in a language like C++, you may need to make some
changes to it if you want to run it on another platform. But it isn’t the same with
Python. Here, you need to code only once, and you can run it anywhere. This is called
Write Once Run Anywhere (WORA). However, you need to be careful enough not to
include any system-dependent features.
11. Interpreted
Lastly, we will say that it is an interpreted language. Since statements are executed
one by one, debugging is easier than in compiled languages.
Any doubts till now in the advantages of Python? Mention in the comment section.
1. Less Coding
Almost all of the tasks done in Python requires less coding when the same task is
done in other languages. Python also has an awesome standard library support, so you
don’t have to search for any third-party libraries to get your job done. This is the
reason that many people suggest learning Python to beginners.
27
2. Affordable
The 2019 Github annual survey showed us that Python has overtaken Java in the most
popular programming language category.
Python code can run on any machine whether it is Linux, Mac or Windows.
Programmers need to learn different languages for different jobs but with Python, you
can professionally build web apps, perform data analysis and machine learning,
automate things, do web scraping and also build games and powerful visualizations. It
is an all-rounder programming language.
Disadvantages of Python
So far, we’ve seen why Python is a great choice for your project. But if you choose it,
you should be aware of its consequences as well. Let’s now see the downsides of
choosing Python over another language.
1. Speed Limitations
We have seen that Python code is executed line by line. But since Python is
interpreted, it often results in slow execution. This, however, isn’t a problem unless
speed is a focal point for the project. In other words, unless high speed is a
requirement, the benefits offered by Python are enough to distract us from its speed
limitations.
The reason it is not so famous despite the existence of Brython is that it isn’t that
secure.
28
3. Design Restrictions
As you know, Python is dynamically typed. This means that you don’t need to declare
the type of variable while writing the code. It uses duck-typing. But wait, what’s that?
Well, it just means that if it looks like a duck, it must be a duck. While this is easy on
the programmers during coding, it can raise run-time errors.
Compared to more widely used technologies like JDBC (Java DataBase Connectivity)
and ODBC (Open DataBase Connectivity), Python’s database access layers are a bit
underdeveloped. Consequently, it is less often applied in huge enterprises.
5. Simple
No, we’re not kidding. Python’s simplicity can indeed be a problem. Take my
example. I don’t do Java, I’m more of a Python person. To me, its syntax is so simple
that the verbosity of Java code seems unnecessary.
This was all about the Advantages and Disadvantages of Python Programming
Language.
NumPy
Besides its obvious scientific uses, NumPy can also be used as an efficient multi-
dimensional container of generic data. Arbitrary datatypes can be defined using
29
NumPy which allows NumPy to seamlessly and speedily integrate with a wide variety
of databases.
Pandas
Matplotlib
Scikit – learn
30
Install Python Step-by-Step in Windows and Mac
There have been several updates in the Python version over the years. The question is
how to install Python? It might be confusing for the beginner who is willing to start
learning Python but this tutorial will solve your query. The latest or the newest
version of Python is version 3.7.4 or in other words, it is Python 3.
Note: The python version 3.7.4 cannot be used on Windows XP or earlier devices.
Before you start with the installation process of Python. First, you need to know about
your System Requirements. Based on your system type i.e. operating system and
based processor, you must download the python version. My system type is a
Windows 64-bit operating system. So the steps below are to install python version
3.7.4 on Windows 7 device or to install Python 3. Download the Python Cheatsheet
here.The steps on how to install Python on Windows 10, 8 and 7 are divided into 4
parts to help understand better.
Step 1: Go to the official site to download and install python using Google Chrome or
any other web browser. OR Click on the following link: https://www.python.org
31
Now, check for the latest and the correct version for your operating system.
Step 3: You can either select the Download Python for windows 3.7.4 button in
Yellow Color or you can scroll further down and click on download with respective to
their version. Here, we are downloading the most recent python version for windows
3.7.4
32
Step 4: Scroll down the page until you find the Files option.
Step 5: Here you see a different version of python along with the operating system.
• To download Windows 32-bit python, you can select any one from the three
options: Windows x86 embeddable zip file, Windows x86 executable installer
or Windows x86 web-based installer.
• To download Windows 64-bit python, you can select any one from the three
options: Windows x86-64 embeddable zip file, Windows x86-64 executable
installer or Windows x86-64 web-based installer.
Here we will install Windows x86-64 web-based installer. Here your first part
regarding which version of python is to be downloaded is completed. Now we move
ahead with the second part in installing python i.e. Installation
Note: To know the changes or updates that are made in the version you can click on
the Release Note Option.
33
Installation of Python
Step 1: Go to Download and Open the downloaded python version to carry out the
installation process.
Step 2: Before you click on Install Now, Make sure to put a tick on Add Python 3.7 to
PATH.
Step 3: Click on Install NOW After the installation is successful. Click on Close.
34
With these above three steps on python installation, you have successfully and
correctly installed Python. Now is the time to verify the installation.
35
Step 4: Let us test whether the python is correctly installed. Type python –V and press
Enter.
Note: If you have any of the earlier versions of Python already installed. You must
first uninstall the earlier version and then install the new one.
Step 3: Click on IDLE (Python 3.7 64-bit) and launch the program
Step 4: To go ahead with working in IDLE you must first save the file. Click on File >
Click on Save
36
Step 5: Name the file and save as type should be Python files. Click on SAVE. Here I
have named the files as Hey World.
Step 6: Now for e.g. enter print (“Hey World”) and Press Enter.
You will see that the command given is launched. With this, we end our tutorial on
how to install Python. You have learned how to download python for windows into
your respective operating system.
Note: Unlike Java, Python does not need semicolons at the end of the statements
otherwise it won’t work.
37
CHAPTER 7
SYSTEM REQUIREMENTS
The functional requirements or the overall description documents include the product
perspective and features, operating system and operating environment, graphics
requirements, design constraints and user documentation.
Minimum hardware requirements are very dependent on the particular software being
developed by a given Enthought Python / Canopy / VS Code user. Applications that
need to store large arrays/objects in memory will require more RAM, whereas
applications that need to perform numerous calculations or tasks more quickly will
require a faster processor.
38
CHAPTER 8
FUNCTIONAL REQUIREMENTS
Output Design
Outputs from computer systems are required primarily to communicate the results of
processing to users. They are also used to provides a permanent copy of the results for
later consultation. The various types of outputs in general are:
Output Definition
Input Design
Input design is a part of overall system design. The main objective during the input
design is as given below:
39
• To ensure that the input is acceptable and understood by the user.
Input Stages
• Data recording
• Data transcription
• Data conversion
• Data verification
• Data control
• Data transmission
• Data validation
• Data correction
Input Types
Input Media
At this stage choice has to be made about the input media. To conclude about the
input media consideration has to be given to;
• Type of input
• Flexibility of format
• Speed
• Accuracy
• Verification methods
• Rejection rates
• Ease of correction
• Storage and handling requirements
40
• Security
• Easy to use
• Portability
Keeping in view the above description of the input types and input media, it can be
said that most of the inputs are of the form of internal and interactive. As
Input data is to be the directly keyed in by the user, the keyboard can be considered to
be the most suitable input device.
Error Avoidance
At this stage care is to be taken to ensure that input data remains accurate form the
stage at which it is recorded up to the stage in which the data is accepted by the
system. This can be achieved only by means of careful control each time the data is
handled.
Error Detection
Even though every effort is make to avoid the occurrence of errors, still a small
proportion of errors is always likely to occur, these types of errors can be discovered
by using validations to check the input data.
Data Validation
Procedures are designed to detect errors in data at a lower level of detail. Data
validations have been included in the system in almost every area where there is a
possibility for the user to commit errors. The system will not accept invalid data.
Whenever an invalid data is keyed in, the system immediately prompts the user and
the user has to again key in the data and the system will accept the data only if the
data is correct. Validations have been included where necessary.
The system is designed to be a user friendly one. In other words the system has been
designed to communicate effectively with the user. The system has been designed
with popup menus.
41
User Interface Design
It is essential to consult the system users and discuss their needs while designing the
user interface:
• User initiated interface the user is in charge, controlling the progress of the
user/computer dialogue. In the computer-initiated interface, the computer
selects the next stage in the interaction.
• Computer initiated interfaces
Computer-Initiated Interfaces
• The menu system for the user is presented with a list of alternatives and the
user chooses one; of alternatives.
• Questions – answer type dialog system where the computer asks question and
takes action based on the basis of the users reply.
Right from the start the system is going to be menu driven, the opening menu displays
the available options. Choosing one option gives another popup menu with more
options. In this way every option leads the users to data entry form where the user
can key in the data.
42
Error Message Design
The design of error messages is an important part of the user interface design. As
user is bound to commit some errors or other while designing a system the system
should be designed to be helpful by providing the user with information regarding the
error he/she has committed.
This application must be able to produce output at different modules for different
inputs.
Performance Requirements
The requirement specification for any system can be broadly stated as given below:
43
CHAPTER 9
SOURCE CODE
import pandas as pd
import numpy as np
import os
import joblib
dataset=pd.read_csv("mqttdataset_reduced.csv")
dataset
dataset.replace(0x00000000, 0, inplace=True)
dataset['target'].unique()
44
dataset.isnull().sum()
dataset.info()
dataset.columns
sns.set(style="darkgrid")
plt.figure(figsize=(8, 6))
ax = plt.gca()
for p in ax.patches:
plt.xlabel("Categories")
plt.ylabel("Count")
plt.title("Count Plot")
plt.show()
X=dataset.iloc[:,:-1]
X.head()
y=dataset.iloc[:,-1]
45
y
#Label encoding
le=LabelEncoder()
y=le.fit_transform(y)
X['tcp.flags'] = le.fit_transform(X['tcp.flags'])
X['mqtt.conflags'] = le.fit_transform(X['mqtt.conflags'])
X['mqtt.hdrflags'] = le.fit_transform(X['mqtt.hdrflags'])
X['mqtt.msg'] = le.fit_transform(X['mqtt.msg'])
X['mqtt.protoname'] = le.fit_transform(X['mqtt.protoname'])
#Random forest
filename='randomforestmodel.pkl'
if os.path.exists(filename):
rf = joblib.load(filename)
46
else:
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print(f"Accuracy: {accuracy}")
#Confusion_matrix
cm=confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots()
47
# Set labels and title
ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')
plt.show()
#KNN
model_filename = 'model.pkl'
if os.path.exists(model_filename):
knn = joblib.load(model_filename)
else:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
joblib.dump(knn, model_filename)
y_pred1 = knn.predict(X_test)
48
print(f"Accuracy: {accuracy}")
cm=confusion_matrix(y_test, y_pred1)
fig, ax = plt.subplots()
ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')
ax.set_title('Random_forest_Confusion Matrix')
plt.show()
A='legitimate'
49
B='dos'
C='slowite'
D='malformed'
E='bruteforce'
F='flood'
predict = knn.predict(X_test)
for i in range(len(predict)):
if predict[i] == 0:
print("{}:***********************************{}".format(X_test.iloc[i,:],A))
elif predict[i]== 1:
elif predict[i]== 2:
elif predict[i]==3:
elif predict[i]==4:
else:
print("{} :***********************************{}".format(X_test.iloc[i,
:],F))
50
CHAPTER 10
This project is for building and evaluating machine learning models for MQTT data
analytics, which is a lightweight messaging protocol often used in IoT applications.
This loads MQTT data, preprocesses it, builds and evaluates two machine learning
models (Random Forest and KNN) for classification, and provides insights into the
model performance through visualizations and predictions on test data.
3. Data Preprocessing
4. Data Splitting: Split the data into training and testing sets using scikit-learn's
train_test_split function. The training set contains 80% of the data, while the testing
set contains 20%.
51
5. KNN Classifier
• Check if a pre-trained KNN classifier model exists in a file named 'model.pkl'.
If it exists, load the model; otherwise, train a new KNN classifier on the
training data and save it to the file.
• Use the trained KNN model to make predictions on the test data.
• Calculate and print the accuracy of the KNN model.
• Print a classification report with detailed evaluation metrics.
• Create and display a confusion matrix heatmap.
7. Data Visualization: Create a count plot of the 'target' variable to visualize the
distribution of different MQTT message types.
tcp.flags: This refers to the flags used in TCP packets. TCP flags are control bits
within the TCP header that indicate various conditions and control actions in the TCP
connection.
mqtt.kalive: This represents the keep-alive interval for the MQTT connection.
mqtt.proto_len, mqtt.protoname: These fields are related to the MQTT protocol and
its length.
target: This field is the target or label associated with each data entry, which is used
for classification purposes.
53
Target Variables
Legitimate: This category refers to network traffic that is considered normal and
acceptable. It includes genuine connections, messages, or activities that are in line
with the expected behavior of the network. Legitimate traffic is typically free from
malicious intent or anomalies.
Slowite: It's a type of network behavior that causes a slowdown or degradation in the
performance of the network. It will occur be due to various factors such as inefficient
communication, bottlenecks, or other issues.
Brute Force: Brute force attacks involve systematically trying all possible
combinations of passwords or keys until the correct one is found. In the context of
network security, a brute force attack might be an attempt to gain unauthorized access
to a system by repeatedly trying different credentials.
Flood: A "flood" typically refers to a large and rapid influx of data or requests. In the
context of network traffic, a flood attack involves overwhelming a system or network
with a massive volume of traffic. This can lead to congestion and potentially disrupt
the normal functioning of the targeted network.
54
plot. It displays the distribution of different target categories or message types in the
MQTT dataset. Each bar in the plot corresponds to a specific message type, and the
height of the bar represents the count or frequency of that message type in the dataset.
This visualization helps us understand the class distribution, which can be crucial for
assessing potential class imbalances.
55
Figure 10.1: Sample dataset used for MQTT data analytics.
56
Figure 10.2: Classification report and accuracy obtained using RF classification.
57
Figure 10.2: Illustration of Confusion matrix obtained using RF classification on
MQTT data.
Figure 10.3: Accuracy and classification report obtained using KNN classification.
Figure 10.3: Displaying the confusion matrix obtained using KNN classification.
58
CHAPTER 11
In the context of MQTT data analytics, this project has successfully accomplished its
primary objective of developing and evaluating machine learning models for
classifying MQTT messages into distinct categories or message types. The Random
Forest Classifier and KNN Classifier have been effectively implemented and assessed
for their classification capabilities. Through accuracy measurements and the
generation of comprehensive classification reports, both models have demonstrated
their competence in categorizing MQTT data with accuracy and precision. One
notable feature of the project is the visualization of model performance using
confusion matrices. These matrices provide a clear representation of the models'
effectiveness by depicting true positives, true negatives, false positives, and false
negatives. Such visual insights are invaluable for understanding the models' strengths
and weaknesses.
To ensure data readiness for machine learning, the project engaged in critical data
preprocessing tasks, including data cleansing and label encoding. This preprocessing
step plays a pivotal role in ensuring that the data is well-suited for analysis and
classification. Moreover, the project encompasses mechanisms for persisting trained
models. By saving and loading pre-trained models, the efficiency of handling new
MQTT data is greatly improved. This obviates the need for retraining models from
scratch, streamlining the process of classifying incoming data. The project also
devoted attention to data exploration through count plots, which provide a visual
representation of the distribution of MQTT message types. This exploration step aids
in comprehending the dataset's class distribution, which can be crucial for subsequent
decision-making and analysis.
Future Scope
There are several scopes for enhancing this MQTT data analytics project. Firstly,
optimizing the models' hyperparameters can be considered to potentially boost
classification performance. Techniques like grid search or random search can be
employed to systematically identify the most effective hyperparameter configurations.
59
Additionally, model ensemble strategies, such as stacking or bagging, can be
investigated to combine the strengths of multiple models, potentially leading to
improved classification accuracy. Further, anomaly detection techniques could be
integrated to identify unusual or unexpected MQTT messages, which can be
indicative of security breaches or system irregularities.
60
CHAPTER 12
REFERENCES
[1] Al-Masri, E.; Kalyanam, K.R.; Batts, J.; Kim, J.; Singh, S.; Vo, T.; Yan, C. Investigating
messaging protocols for the Internet of Things (IoT). IEEE Access 2020, 8, 94880–94911.
[2] Kodali, R.K.; Soratkal, S. MQTT Based Home Automation System Using ESP8266. In
Proceedings of the 2016 IEEE Region 10 Humanitarian Technology Conference (R10-HTC),
Agra, India, 21–23 December 2016; pp. 1–5.
[3] Cornel-Cristian, A.; Gabriel, T.; Arhip-Calin, M.; Zamfirescu, A. Smart Home Automation with
MQTT. In Proceedings of the 2019 54th International Universities Power Engineering
Conference (UPEC), Bucharest, Romania, 3–6 September 2019; pp. 1–5.
[4] Prabaharan, J.; Swamy, A.; Sharma, A.; Bharath, K.N.; Mundra, P.R.; Mohammed, K.J.
Wireless Home Automation and Securitysystem Using MQTT Protocol. In Proceedings of the
2017 2nd IEEE International Conference on Recent Trends in Electronics, Information &
Communication Technology (RTEICT), Bangalore, India, 19–20 May 2017; pp. 2043–2045.
[5] Kodali, R.K.; Sarjerao, B.S. A Low Cost Smart Irrigation System Using MQTT Protocol. In
Proceedings of the 2017 IEEE Region 10 Symposium (TENSYMP), Cochin, India, 14–16 July
2017; pp. 1–5.
[6] Mukherji, S.V.; Sinha, R.; Basak, S.; Kar, S.P. Smart Agriculture Using Internet of Things and
mqtt Protocol. In Proceedings of the 2019 International Conference on Machine Learning, Big
Data, Cloud and Parallel Computing (COMITCon), Faridabad, India, 14–16 February 2019; pp.
14–16.
[7] Rayes, A.; Salam, S. Internet of Things from Hype to Reality—The Road to Digitization, 2nd
ed.; Springer: Cham, Switzerland, 2019.
[8] Anam, M.; Ponnusamy, V.; Hussain, M.; Nadeem, M.W.; Javed, M.; Goh, H.G.; Qadeer, S.
Osteoporosis Prediction for Trabecular Bone Using Machine Learning: A Review. Comput.
Mater. Contin. 2021, 67, 89–105.
[9] Polat, H.; Polat, O.; Cetin, A. Detecting DDoS Attacks in Software-Defined Networks Through
Feature Selection Methods and Machine Learning Models. Sustainability 2020, 12, 1035.
[10] Ochôa, I.S.; Leithardt, V.R.Q.; Calbusch, L.; Santana, J.F.D.P.; Parreira, W.D.; Seman, L.O.;
Zeferino, C.A. Performance and Security Evaluation on a Blockchain Architecture for License
Plate Recognition Systems. Appl. Sci. 2021, 11, 1255.
[11] Anjos, J.C.S.D.; Gross, J.L.G.; Matteussi, K.J.; González, G.V.; Leithardt, V.R.Q.; Geyer,
C.F.R. An Algorithm to Minimize Energy Consumption and Elapsed Time for IoT Workloads
in a Hybrid Architecture. Sensors 2021, 21, 2914.
[12] Ganguly, S.; Garofalakis, M.; Rastogi, R.; Sabnani, K. Streaming Algorithms for Robust, Real-
Time Detection of ddos Attacks. In Proceedings of the 27th International Conference on
Distributed Computing Systems (ICDCS’07), Toronto, ON, Canada, 25–27 June 2007; p. 4.
61
[13] Soni, D.; Makwana, A. A Survey on mqtt: A Protocol of Internet of Things (Iot). In
Proceedings of the International Conference on Telecommunication, Power Analysis and
Computing Techniques (ICTPACT-2017), Chennai, India, 6–8 April 2017; Volume 20.
[14] Hunkeler, U.; Truong, H.L.; Stanford-Clark, A. MQTT-S—A Publish/Subscribe Protocol for
Wireless Sensor Networks. In Proceedings of the 2008 3rd International Conference on
Communication Systems Software and Middleware and Workshops (COMSWARE’08),
Bangalore, India, 6–10 January 2008; pp. 791–798.
[15] Ahmadon, M.A.B.; Yamaguchi, N.; Yamaguchi, S. Process-Based Intrusion Detection Method
for IoT System with MQTT Protocol. In Proceedings of the 2019 IEEE 8th Global Conference
on Consumer Electronics (GCCE), Osaka, Japan, 15–18 October 2019; pp. 953–956.
[16] Jan, S.U.; Lee, Y.D.; Koo, I.S. A distributed sensor-fault detection and diagnosis framework
using machine learning. Inf. Sci. 2021, 547, 777–796.
[17] Alaiz-Moreton, H.; Aveleira-Mata, J.; Ondicol-Garcia, J.; Muñoz-Castañeda, A.L.; García, I.;
Benavides, C. Multiclass classification procedure for detecting attacks on MQTT-IoT protocol.
Complexity 2019, 2019, 6516253.
[18] Hindy, H.; Bayne, E.; Bures, M.; Atkinson, R.; Tachtatzis, C.; Bellekens, X. Machine Learning
Based IoT Intrusion Detection System: An MQTT Case Study (MQTT-IoT-IDS2020 Dataset).
In Proceedings of the International Networking Conference, Online, 19–21 September 2020;
Springer: Cham, Switzerland, 2020; pp. 73–84.
[19] Ullah, I.; Ullah, A.; Sajjad, M. Towards a Hybrid Deep Learning Model for Anomalous
Activities Detection in Internet of Things Networks. IoT 2021, 2, 428–448.
[20] Almaiah, M.A.; Almomani, O.; Alsaaidah, A.; Al-Otaibi, S.; Bani-Hani, N.; Hwaitat, A.K.A.;
Al-Zahrani, A.; Lutfi, A.; Awad, A.B.; Aldhyani, T.H.H. Performance Investigation of Principal
Component Analysis for Intrusion Detection System Using Different Support Vector Machine
Kernels. Electronics 2022, 11, 3571.
[21] Shalaginov, A.; Semeniuta, O.; Alazab, M. MEML: Resource-Aware MQTT-Based Machine
Learning for Network Attacks Detection on IoT Edge Devices. In Proceedings of the 12th
IEEE/ACM International Conference on Utility and Cloud Computing Companion, Auckland,
New Zealand, 2–5 December 2019; pp. 123–128.
[22] Ujjan, R.M.A.; Pervez, Z.; Dahal, K.; Khan, W.A.; Khattak, A.M.; Hayat, B. Entropy Based
Features Distribution for Anti-DDoS Model in SDN. Sustainability 2021, 13, 1522.
[23] Gadze, J.D.; Bamfo-Asante, A.A.; Agyemang, J.O.; Nunoo-Mensah, H.; Opare, K.A.-B. An
Investigation into the Application of Deep Learning in the Detection and Mitigation of DDOS
Attack on SDN Controllers. Technologies 2021, 9, 14.
[24] Ahuja, N.; Singal, G.; Mukhopadhyay, D.; Kumar, N. Automated DDOS attack detection in
software defined networking. J. Netw. Comput. Appl. 2021, 187, 103108.
[25] Wang, Z.; Zeng, Y.; Liu, Y.; Li, D. Deep belief network integrating improved kernel-based
extreme learning machine for network intrusion detection. IEEE Access 2021, 9, 16062–16091.
62