0% found this document useful (0 votes)
13 views

documentation sample

Uploaded by

Srestha Pulipati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

documentation sample

Uploaded by

Srestha Pulipati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

A

Mini Project Report


on
MACHINE LEARNING MODEL FOR MESSAGE QUEUING
TELEMETRY TRANSPORT DATA ANALYTICS
Submitted for partial fulfilment of the requirements for the award of the degree of

BACHELOR OF TECHNOLOGY
in
INFORMATION TECHNOLOGY
Submitted by

S VARUN KUMAR 20K81A12H0


GANGAVENNOLA SRUJAN 20K81A12E3
MANDAJI ADHARSH 21K85A1218
ADIKAM NIKHIL GOUD 20K81A12C1

Under the Guidance of


Mrs. T BHARGAVI
ASSISTANT PROFESSOR

DEPARTMENT OF INFORMATION TECHNOLOGY

St. MARTIN'S ENGINEERING COLLEGE


UGC Autonomous
Affiliated to JNTUH, Approved by AICTE,
Accredited by NBA & NAAC A+, ISO 9001-2008 Certified
Dhulapally, Secunderabad - 500 100
www.smec.ac.in

OCTOBER – 2023

1
St. MARTIN'S ENGINEERING COLLEGE
UGC Autonomous
Affiliated to JNTUH, Approved by AICTE
NBA & NAAC A+ Accredited
Dhulapally, Secunderabad - 500 100
www.smec.ac.in

Certificate

This is to certify that the project entitled “MACHINE LEARNING MODEL


FOR MESSAGE QUEUING TELEMETRY TRANSPORT DATA
ANALYTICS” is being submitted by S. Varun Kumar (20K81A12H0), G. Srujan
(20K81A12E3), M. Adharsh(21K85A1218), A. Nikhil Goud(20K81A12C1) in
fulfilment of the requirement for the award of degree of BACHELOR OF
TECHNOLOGY IN INFORMATION TECHNOLOGY is recorded of bonafide
work carried out by them. The result embodied in this report have been verified and found
satisfactory.

Signature of Guide Signature of HOD


Mrs. T. BHARGAVI Dr. V K SENTHIL RAGAVAN
Assistant Professor Professor and Head of Department
Department of Information Technology Department of Information Technology

Internal Examiner External Examiner

Place:

Date:

2
St. MARTIN'S ENGINEERING COLLEGE
UGC Autonomous
Affiliated to JNTUH, Approved by AICTE
NBA & NAAC A+ Accredited
Dhulapally, Secunderabad - 500 100
www.smec.ac.in

DEPARTMENT OF INFORMATION TECHNOLOGY

Declaration
We, the students of ‘Bachelor of Technology in Department of Information
Technology’, session: 2020 - 2024, St. Martin’s Engineering College, Dhulapally,
Kompally, Secunderabad, hereby declare that the work presented in this Project
Work entitled “MACHINE LEARNING MODEL FOR MESSAGE QUEUING
TELEMETRY TRANSPORT DATA ANALYTICS” is the outcome of our own
bonafide work and is correct to the best of our knowledge and this work has been
undertaken taking care of Engineering Ethics. This result embodied in this project
report has not been submitted in any university for award of any degree.

S VARUN KUMAR 20K81A12H0


GANGAVENNOLA SRUJAN 20K81A12E3
MANDAJI ADHARSH 21K85A1218
ADIKAM NIKHIL GOUD 20K81A12C1

3
ACKNOWLEDGEMENT

The satisfaction and euphoria that accompanies the successful completion of


any task would be incomplete without the mention of the people who made it possible
and whose encouragement and guidance have crowded our efforts with success.
First and foremost, we would like to express our deep sense of gratitude and
indebtedness to our College Management for their kind support and permission to use
the facilities available in the Institute.
We especially would like to express our deep sense of gratitude and
indebtedness to Dr. P. SANTOSH KUMAR PATRA, Group Director, St. Martin’s
Engineering College Dhulapally, for permitting us to undertake this project.
We wish to record our profound gratitude to Dr. M. SREENIVAS RAO,
Principal, St. Martin’s Engineering College, for has motivation and encouragement
We are also thankful to Dr. V K. SENTHIL RAGAVAN, Head of the
Department, Information Technology, St. Martin’s Engineering College, Dhulapally,
Secunderabad. for his support and guidance throughout our project as well as Project
Coordinator Mr. G. SATHISH, Assistant Professor, Information Technology
department for his valuable support.
We would like to express our sincere gratitude and indebtedness to our project
supervisor Mrs. T. BHARGAVI, Assistant Professor, Information Technology, St.
Martins Engineering College, Dhulapally, for her support and guidance throughout
our project.
Finally, we express thanks to all those who have helped us successfully
completing this project. Furthermore, we would like to thank our family and friends
for their moral support and encouragement. We express thanks to all those who have
helped us in successfully completing the project.

S VARUN KUMAR 20K81A12H0


GANGAVENNOLA SRUJAN 20K81A12E3
MANDAJI ADHARSH 21K85A1218
ADIKAM NIKHIL GOUD 20K81A12C1

i
CONTENTS

ACKNOWLEDGMENT i

ABSTRACT ii

LIST OF FIGURES iv

LIST OF TABLES v

LIST OF ACRONYMS AND DEFINITION vi

ABSTRACT vii

CHAPTER 1: INTRODUCTION 1

CHAPTER 2: LITERATURE SURVEY 3

CHAPTER 3: EXISTING TECHNIQUE 6

3.1 KNN Classifier 6

3.2 Limitations 9

CHAPTER 4: PROPOSED METHODOLOGY 10

4.1 Overview 10

4.2 Data Preprocessing 12

4.3 Splitting the Dataset 14

4.4 Random Forest Algorithm 15

CHAPTER 5: UML DIAGRAMS 20

CHAPTER 6: SOFTWARE ENVIRONMENT 25

CHAPTER 7: SYSTEM REQUIREMENT 38

7.1 Software Requirements 38

7.2 Hardware Requirements 38

CHAPTER 8: FUNCTIONAL REQUIREMENTS 39

CHAPTER 9: SOURCE CODE 44

CHAPTER 10: RESULT AND DESCUSSION 51

10.1 Implementation Description 51

ii
10.2 Dataset Description 52

10.3 Results and Description 54

CHAPTER 11: CONCLUSION AND FUTURE SCOPE 59

CHAPTER 12: REFERENCES 61

iii
LIST OF FIGURES

Figure No. Figure Title Page No.

1.1 Topology of the MQTT Protocol 1

1.2 Connecting MQTT with IOT Applications 2

3.1 KNN Dataset 6

3.2 Considering new data Point 7

3.3 Measuring of Euclidean Distance 8

3.4 Assigning data Point to Category A 8

4.1 Overall Design of Proposed Methodology 11

4.2 Feature Scaling 13

4.3 Splitting the Dataset 14

4.4 Random Forest Algorithm 15

4.5 RF Classifier Analysis 17

4.6 Boosting RF Classifier 18

5.1 Class Diagram 21

5.2 Data Flow Diagram 22

5.3 Sequence Diagram 23

5.4 Activity Diagram 24

10.1 Count Plot of Target Categories 56

10.2 Confusion Matrix of RF 57

10.3 Action Recognition 58

iv
LIST OF TABLES

Table No. Table Name Page No.

10.1 Sample dataset using for data analytics 56

10.2 Accuracy obtained using RF 57

10.3 Accuracy obtained using KNN 58

v
LIST OF ACRONYMS AND DEFINITIONS

S.No: Acronym Definition

01. KNN K-Nearest Neighbors

02. RF Random Forest

03. DDOS Distributed Denial of Service

04. ML Machine Learning

05. DL Deep Learning

06. LSTM Long Short-term Memory Networks

07. CNN Convolutional Neural Network

08. SDN Software Defined Network

09. DBN Deep Belief Network

10. DFD Data Flow Diagram

11. UML Unified Modelling Language

vi
ABSTRACT

Message Queuing Telemetry Transport (MQTT) has become a dominant protocol


for the Internet of Things (IoT) because of its efficient use of bandwidth and low
power consumption. It enables seamless communication between devices with limited
resources, making it a perfect fit for various IoT applications like smart homes,
industrial automation, and healthcare monitoring. This lightweight messaging
protocol is widely adopted in the IoT ecosystem to facilitate smooth communication
between devices and applications. As the number of IoT devices continues to grow
rapidly, so does the volume of data they generate. This surge in data highlights the
need for robust analytics solutions capable of extracting valuable insights from the
MQTT data stream. Traditional analytics methods struggle to handle real-time data
processing and uncover actionable insights, which underscores the necessity for
specialized machine learning solutions. The proposed machine learning model is of
great significance as it has the potential to unleash the power of MQTT data and
transform it into actionable insights. By harnessing advanced algorithms and adaptive
learning techniques, this model can identify patterns, anomalies, and trends in real-
time MQTT data streams. This empowers businesses and organizations to make
informed decisions and optimize their IoT operations effectively. With the ability to
handle diverse data sources, the model becomes versatile and applicable across
various domains, such as healthcare, smart cities, agriculture, and manufacturing.

vii
CHAPTER 1
INTRODUCTION
1.1 Overview

When the Internet of Things (IoT) is implemented, physical devices (also known as
IoT nodes) are connected to the internet, enabling them to collect and exchange data
with other nodes in the network without the need for human participation [1].
Message queuing telemetry transfer (MQTT) gained widespread use in a range of
applications, such as in smart homes [2–4], agricultural IoT [5, 6], and industrial
applications. This is mainly due to its capacity to communicate at low bandwidths, the
necessity for minimum memory, and reduced packet loss. Figure 1 depicts the
architecture of the MQTT protocol for use in the IoT.

Figure 1.1: Topology of the MQTT protocol.

The IoT and associated technologies evolved at a rapid rate, with 15 billion linked
devices in 2015, which is likely to increase to 38 billion devices by 2025, according to
Gartner [7]. The IoT is a network of objects—linked by sensors, actuators, gateways,
and cloud services—that delivers a service to users. The MQTT protocol was
integrated into a number of IoT applications. Figure 2 depicts how the MQTT
protocol maintains IoT applications. Traditional intrusion detection systems (IDSs)
are only successful when dealing with data that move slowly or with small volumes of
data. They are currently inefficient when dealing with big data or networks and are
unable to cope with high-speed data transmission. Therefore, technologies capable of.

1
dealing with massive volumes of data and identifying any indications of network
penetration are crucial. When it comes to big data, data security and privacy are
perhaps the most pressing concerns, especially in the context of network assaults.
Distributed denial-of-service (DDoS) attacks are one of the most common types of
cyberattacks. They target servers or networks with the intent of interfering with their
normal operation. Although real-time detection and mitigation of DDoS attacks is
difficult to achieve, a solution would be extremely valuable, since attacks can cause
significant damage.

Figure 1.2: Connecting MQTT with IoT applications.

2
CHAPTER 2
LITERATURE SURVEY

Machine learning (ML) studies are always being improved through the use of
training data and the exploitation of available information. Some consider ML to be a
component of artificial intelligence. Depending on the information provided, various
types of learning can be undertaken, including supervised learning—for example, the
support vector machine (SVM) algorithm and the k-nearest neighbors (KNN)
algorithm—semi-supervised learning, and unsupervised learning (e.g., clustering
methods). Deep learning (DL) models combined with ML techniques produce
excellent results in cybersecurity systems used for detecting attacks. ML techniques
are used in multiple contexts, such as in healthcare. For example, they are being used
to forecast COVID-19 outbreaks, osteoporosis, and schistosomiasis, among other
health-related problems [8, 9]. Many researchers employed classification algorithms
to detect and resolve DDoS attacks with the goal of reducing the number of attacks.
DDoS attacks are simple to carry out because they take advantage of network flaws
and generate requests for software services [10, 11]. DDoS attacks take a long time to
identify and neutralize, and this solution is particularly useful, since these attacks may
cause major harm. There are significant drawbacks to the current methods used to
detect DDoS attacks, such as high processing costs and the inability to handle
enormous quantities of data reaching the server [12]. Using a variety of classification
methods, classification algorithms differentiate DDoS packets from other kinds of
packets [13–15]. To secure the IoT against anomalous adversarial attacks, various
security-enhancing solutions were developed. These approaches are often used to
detect attacks in IoT networks by monitoring IoT node operations, such as the rate at
which data are sent. In this paper, we introduce a brief review of the literature to
highlight recent advancements in IoT security systems, with a particular emphasis on
IDSs that target the MQTT protocol. The authors of [16] provided a process tree-
based intrusion detection technique for MQTT protocols based on their previous
work. It describes network behavior in terms of the hierarchical branches of a tree,
which can then be used to detect assaults or aberrant behavior in the network. The
detection rate was used to evaluate the model, and four frequent types of assaults were

3
introduced into the network to assess its performance. However, little consideration
was given to newly created adversarial attacks and intrusions.

The study [17] proposes a fuzzy logic model for intrusion detection that is specifically
built to safeguard IoT nodes that use the MQTT protocol from denial-of-service
(DoS) attacks. Although fuzzy logic demonstrated its efficacy in a variety of systems,
including sensor device intrusion detection in the IoT [18], its high difficulty with
increasing input dimensions limits its ability to detect attacks on IoT platforms where
large amounts of data are transferred on a continuous basis. The extreme gradient
boosting (XGBoost) algorithm gated recurrent units (GRUs), and LSTM are only a
few of the ML methods used in [19, 20] to create a cybersecurity system for the
MQTT protocol in the IoT. The MQTT dataset, which contains three forms of attacks,
including intrusion (illegal entrance), DoS, and malicious code injection and man-in-
the-middle attack (MitM), was used to verify the proposed techniques. To test a range
of ML approaches, the MQTT-IoTIDS2020 dataset was used. Using these ML
approaches, it was found that a system for detecting MQTT attacks could be designed,
and this was later validated by the researchers. An MQTT-enabled IoT cybersecurity
system demonstrated the use of an ANN approach for intrusion detection [21].

Ujjan et al. [22] presented an entropy-based features section to identify the important
features in network traffic for detecting DoS attacks by employing an encoder (SAE)
and CNN models. CPU consumption was significantly higher and took significantly
longer. The models were accurate to within 94% and 93% of the true value,
respectively. Using LSTM and CNN, Gadze et al. [23] introduced DL models to
identify DDoS intrusions on a software-defined network’s centralized controller
(CNN). The accuracy of the models was lower than expected. When data were split
out in a 70/30 ratio, the accuracies of LSTM and CNN were 89.63% and 66%,
respectively. However, when using an LSTM model to detect intrusions in network
traffic, DDoS was found to be the most time-consuming attempt out of all 10 attempts
tested. A hybrid ML model, SVM combined with random forest (SVC-RF), was
created by Ahuja et al. [24] and used to distinguish between benign and malicious
traffic. The authors extracted features from the original dataset that were used to build
a new dataset: the SDN dataset, which had innovative features. It was determined that
the SVC-RF classifier is capable of accurately categorizing data traffic with an
accuracy of 98.8% when using the software defined networking (SDN) dataset.

4
Wang et al. [25] revealed that a unique DL model based on an upgraded deep belief
network (DBN) can be used to identify network intrusions more quickly. They
replaced the back propagation approach in DBN with a kernel-based extreme learning
machine (KELM), which was created by the researchers and is still in development.
Their model outperformed other current neural network approaches by a wide margin.
In this study, the researchers examined and tested the accuracy of a number of
different categorization algorithms and techniques. The results reveal that the DBN-
KELM algorithm obtained an accuracy of 93.5%, while the DBN-EGWO-KELM
method achieved an accuracy of 98.60%.

5
CHAPTER 3

EXISTING TECHNIQUE

3.1 KNN Classifier

K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on


Supervised Learning technique. K-NN algorithm assumes the similarity between the
new case/data and available cases and put the new case into the category that is most
similar to the available categories. It stores all the available data and classifies a new
data point based on the similarity. This means when new data appears then it can be
easily classified into a well suite category by using K- NN algorithm. It can be used
for Regression as well as for Classification but mostly it is used for the Classification
problems. It is a non-parametric algorithm, which means it does not make any
assumption on underlying data. It is also called a lazy learner algorithm because it
does not learn from the training set immediately instead it stores the dataset and at the
time of classification, it performs an action on the dataset. KNN algorithm at the
training phase just stores the dataset and when it gets new data, then it classifies that
data into a category that is much like the new data.

Why do we need a K-NN Algorithm?

Suppose there are two categories, i.e., Category A and Category B, and we have a
new data point x1, so this data point will lie in which of these categories. To solve this
type of problem, we need a K-NN algorithm. With the help of K-NN, we can easily
identify the category or class of a particular dataset. Consider the below diagram:

Fig
. 3.1: KNN on dataset.
6
How does K-NN work?

The K-NN working can be explained based on the below algorithm:

Step-1: Select the number K of the neighbors.

Step-2: Calculate the Euclidean distance of K number of neighbors.

Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.

Step-4: Among these k neighbors, count the number of the data points in each
category.

Step-5: Assign the new data points to that category for which the number of the
neighbor is maximum.

Step-6: Model is ready.

Suppose we have a new data point, and we need to put it in the required category.
Consider the below image:

Fig. 3.2: Considering new data point.

Firstly, we will choose the number of neighbors, so we will choose the k=5.

Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in
geometry. It can be calculated as:

7
Fig. 3.3: Measuring of Euclidean distance.

By calculating the Euclidean distance, we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:

Fig. 3.4: Assigning data point to category A.

As we can see the 3 nearest neighbors are from category A, hence this new data point
must belong to category A.

How to select the value of K in the K-NN Algorithm?

Below are some points to remember while selecting the value of K in the K-NN
algorithm:
8
• There is no particular way to determine the best value for "K", so we need to
try some values to find the best out of them. The most preferred value for K is
5.
• A very low value for K such as K=1 or K=2, can be noisy and lead to the
effects of outliers in the model.
• Large values for K are good, but it may find some difficulties.

3.2 Limitations

• Computational Complexity: Making predictions with KNN involves


calculating distances between the new data point and all training data points.
This can be computationally expensive, especially for large datasets.
• Sensitivity to Noise: This model is sensitive to noisy data and outliers. Noisy
data can significantly impact the classification results.
• Need for Optimal K: Selecting the appropriate value of K (the number of
nearest neighbors) is crucial. Choosing an inappropriate K value can lead to
suboptimal results.
• Imbalanced Data: It struggles with imbalanced datasets where one class
significantly outnumbers the others. The majority class can dominate
predictions.
• Curse of Dimensionality: In high-dimensional feature spaces, the notion of
distance becomes less meaningful, and KNN suffers from the curse of
dimensionality, leading to reduced performance.

9
CHAPTER 4

PROPOSED METHODOLOGY

4.1 Overview
This research represents a project for building a machine learning model for MQTT
data analytics, which is a lightweight messaging protocol commonly used in IoT and
other applications. The main goal of this work is to develop machine learning models
that can classify MQTT data into different categories or message types. These
categories might represent different types of MQTT messages or events, such as
legitimate messages, denial-of-service (DoS) attacks, malformed messages, brute
force attacks, and floods. Here's an overview of the main components:

Step 1: Data Preparation

• Load MQTT Data: The project begins by loading MQTT data from a CSV file
named "mqttdataset_reduced.csv" into a Pandas DataFrame.
• Data Cleaning: Replace a specific value (0x00000000) with 0 in the dataset.
• Data Encoding: Encode categorical variables using Label Encoding. This is
necessary for machine learning algorithms, which typically require numerical
input.

Step 2: Data Exploration

• Visualize Data: Create a count plot to visualize the distribution of different


MQTT message types in the dataset. This helps in understanding the data's
class distribution.

Step 3: Machine Learning Models

• Random Forest Classifier: Train a Random Forest classifier on the


preprocessed data. If a pre-trained model exists in a file
('randomforestmodel.pkl'), it is loaded; otherwise, a new model is trained. The
Random Forest model is used for classification tasks.
• K-Nearest Neighbors (KNN) Classifier: Train a KNN classifier on the same
data. Similar to the Random Forest model, it checks for the existence of a pre-
trained model in a file ('model.pkl'). If it doesn't exist, a new model is trained.
KNN is another classification algorithm used for pattern recognition.

10
Step 4: Model Evaluation

• Accuracy Calculation: Calculate and print the accuracy of both the Random
Forest and KNN classifiers on the test data.
• Classification Report: Generate a detailed classification report for both
models, including metrics such as precision, recall, F1-score, and support for
each class.
• Confusion Matrix: Create confusion matrix heatmaps to visualize the model's
performance in terms of true positive, true negative, false positive, and false
negative predictions.

Step 5: Prediction on Test Data

• After training and evaluating the models, the code performs predictions on the
test data for both the Random Forest and KNN classifiers.
• For each prediction, it determines the predicted class label (e.g., 'legitimate',
'dos', 'slowite', etc.) and prints the features of the test data point along with the
predicted class label.

In summary, this project aims to automate the classification of MQTT data into
different message types using machine learning. It includes data preprocessing, model
training and evaluation, and visualizations to assess and understand the models'
performance. Additionally, it provides a way to save and load pre-trained models to
avoid retraining when working with new data.

Figure 4.1: Overall design of proposed methodology.

11
4.2 Data Preprocessing

Data pre-processing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine
learning model. When creating a machine learning project, it is not always a case that
we come across the clean and formatted data. And while doing any operation with
data, it is mandatory to clean it and put in a formatted way. So, for this, we use data
pre-processing task. A real-world data generally contains noises, missing values, and
maybe in an unusable format which cannot be directly used for machine learning
models. Data pre-processing is required tasks for cleaning the data and making it
suitable for a machine learning model which also increases the accuracy and
efficiency of a machine learning model.

• Getting the dataset


• Importing libraries
• Importing datasets
• Splitting dataset into training and test set

Importing Libraries: To perform data preprocessing using Python, we need to


import some predefined Python libraries. These libraries are used to perform some
specific jobs. There are three specific libraries that we will use for data preprocessing,
which are:

Numpy: Numpy Python library is used for including any type of mathematical
operation in the code. It is the fundamental package for scientific calculation in
Python. It also supports to add large, multidimensional arrays and matrices. So, in
Python, we can import it as:

import numpy as nm

Here we have used nm, which is a short name for Numpy, and it will be used in the
whole program.

Matplotlib: The second library is matplotlib, which is a Python 2D plotting library,


and with this library, we need to import a sub-library pyplot. This library is used to
plot any type of charts in Python for the code. It will be imported as below:

12
import matplotlib.pyplot as mpt

Here we have used mpt as a short name for this library.

Pandas: The last library is the Pandas library, which is one of the most famous
Python libraries and used for importing and managing the datasets. It is an open-
source data manipulation and analysis library. Here, we have used pd as a short name
for this library. Consider the below image:

Feature Scaling: Feature scaling is the final step of data preprocessing in machine
learning. It is a technique to standardize the independent variables of the dataset in a
specific range. In feature scaling, we put our variables in the same range and in the
same scale so that no variable dominates the other variable. A machine learning
model is based on Euclidean distance, and if we do not scale the variable, then it will
cause some issue in our machine learning model. Euclidean distance is given as:

Figure 4.2: Feature scaling

If we compute any two values from age and salary, then salary values will dominate
the age values, and it will produce an incorrect result. So, to remove this issue, we
need to perform feature scaling for machine learning.

13
4.3 Splitting the Dataset

In machine learning data preprocessing, we divide our dataset into a training set and
test set. This is one of the crucial steps of data preprocessing as by doing this, we can
enhance the performance of our machine learning model. Suppose if we have given
training to our machine learning model by a dataset and we test it by a completely
different dataset. Then, it will create difficulties for our model to understand the
correlations between the models. If we train our model very well and its training
accuracy is also very high, but we provide a new dataset to it, then it will decrease the
performance. So we always try to make a machine learning model which performs
well with the training set and also with the test dataset. Here, we can define these
datasets as:

Figure 4.3: Splitting the dataset.

Training Set: A subset of dataset to train the machine learning model, and we already
know the output.

Test set: A subset of dataset to test the machine learning model, and by using the test
set, model predicts the output.

For splitting the dataset, we will use the below lines of code:

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)

Explanation

• In the above code, the first line is used for splitting arrays of the dataset into
random train and test subsets.

• In the second line, we have used four variables for our output that are

• x_train: features for the training data

14
• x_test: features for testing data

• y_train: Dependent variables for training data

• y_test: Independent variable for testing data

• In train_test_split() function, we have passed four parameters in which first


two are for arrays of data, and test_size is for specifying the size of the test set.
The test_size maybe .5, .3, or .2, which tells the dividing ratio of training and
testing sets.

• The last parameter random_state is used to set a seed for a random generator
so that you always get the same result, and the most used value for this is 42.

4.4 Random Forest Algorithm


Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in
ML. It is based on the concept of ensemble learning, which is a process of combining
multiple classifiers to solve a complex problem and to improve the performance of the
model. As the name suggests, "Random Forest is a classifier that contains a number of
decision trees on various subsets of the given dataset and takes the average to improve
the predictive accuracy of that dataset." Instead of relying on one decision tree, the
random forest takes the prediction from each tree and based on the majority votes of
predictions, and it predicts the final output. The greater number of trees in the forest
leads to higher accuracy and prevents the problem of overfitting.

Fig. 4.4: Random Forest algorithm.

15
Random Forest algorithm

Step 1: In Random Forest n number of random records are taken from the data set
having k number of records.

Step 2: Individual decision trees are constructed for each sample.

Step 3: Each decision tree will generate an output.

Step 4: Final output is considered based on Majority Voting or Averaging for


Classification and regression respectively.

Important Features of Random Forest

• Diversity- Not all attributes/variables/features are considered while making an


individual tree, each tree is different.
• Immune to the curse of dimensionality- Since each tree does not consider all
the features, the feature space is reduced.
• Parallelization-Each tree is created independently out of different data and
attributes. This means that we can make full use of the CPU to build random
forests.
• Train-Test split- In a random forest we don’t have to segregate the data for
train and test as there will always be 30% of the data which is not seen by the
decision tree.
• Stability- Stability arises because the result is based on majority voting/
averaging.

Assumptions for Random Forest

Since the random forest combines multiple trees to predict the class of the dataset, it is
possible that some decision trees may predict the correct output, while others may not.
But together, all the trees predict the correct output. Therefore, below are two
assumptions for a better Random forest classifier:

• There should be some actual values in the feature variable of the dataset so
that the classifier can predict accurate results rather than a guessed result.
• The predictions from each tree must have very low correlations.
16
Below are some points that explain why we should use the Random Forest algorithm

• It takes less training time as compared to other algorithms.


• It predicts output with high accuracy, even for the large dataset it runs
efficiently.
• It can also maintain accuracy when a large proportion of data is missing.

Types of Ensembles

Before understanding the working of the random forest, we must look into the
ensemble technique. Ensemble simply means combining multiple models. Thus, a
collection of models is used to make predictions rather than an individual model.
Ensemble uses two types of methods:

Bagging– It creates a different training subset from sample training data with
replacement & the final output is based on majority voting. For example, Random
Forest. Bagging, also known as Bootstrap Aggregation is the ensemble technique used
by random forest. Bagging chooses a random sample from the data set. Hence each
model is generated from the samples (Bootstrap Samples) provided by the Original
Data with replacement known as row sampling. This step of row sampling with
replacement is called bootstrap. Now each model is trained independently which
generates results. The final output is based on majority voting after combining the
results of all models. This step which involves combining all the results and
generating output based on majority voting is known as aggregation.

Fig. 4.5: RF classifier analysis.

17
Boosting– It combines weak learners into strong learners by creating sequential
models such that the final model has the highest accuracy. For example, ADA
BOOST, XG BOOST.

Fig. 4.6: Boosting RF classifier.

Working

The Random Forest Classifier is an ensemble learning method that combines multiple
decision trees to make predictions. Each decision tree in the forest is constructed
based on a random subset of the dataset and a random subset of features. This
randomness helps reduce overfitting and improves generalization. In this project, the
Random Forest Classifier is trained using the preprocessed MQTT dataset. If a pre-
trained model exists in a file ('randomforestmodel.pkl'), it is loaded to avoid
retraining. Otherwise, a new Random Forest model is created and trained on the
training data. After training, it is capable of making predictions. It can classify MQTT
data into different message types based on the patterns it has learned during training.
Here, predictions are made on the test dataset. The accuracy of the model, which
represents the proportion of correctly classified samples is calculated by comparing its
predictions to the actual target values in the test dataset. Then after, a classification
report is generated to provide a more detailed evaluation of the model's performance.
It includes metrics such as precision, recall, F1-score, and support for each class
(message type). This report offers insights into the model's strengths and weaknesses
for different classes. Finally, it visualizes the model’s performance using a confusion
matrix heatmap. This matrix helps assess the model's ability to correctly classify
instances and identify any misclassifications.

18
Key Features

• Ensemble Learning: Random Forest is an ensemble learning technique that


combines multiple decision trees to make predictions. This ensemble approach
enhances the model's accuracy and reduces overfitting compared to a single
decision tree.
• Robust to Overfitting: Random Forest introduces randomness by selecting
subsets of the data and features for each tree. This randomness helps prevent
overfitting, making it a robust choice for various datasets.
• Highly Parallelizable: Training each decision tree in the forest is an
independent task, making Random Forest highly parallelizable and suitable for
distributed computing.
• Feature Importance: Random Forest provides a measure of feature importance,
which allows you to identify which features have the most significant impact
on classification decisions. This is valuable for feature selection and
understanding the dataset.
• Handles Categorical Data: Random Forest can handle categorical data without
requiring one-hot encoding, simplifying the preprocessing steps.

19
CHAPTER 5

UML DIAGRAMS

UML stands for Unified Modeling Language. UML is a standardized general-purpose


modeling language in the field of object-oriented software engineering. The standard
is managed, and was created by, the Object Management Group. The goal is for UML
to become a common language for creating models of object-oriented computer
software. In its current form UML is comprised of two major components: a Meta-
model and a notation. In the future, some form of method or process may also be
added to; or associated with, UML.

The Unified Modeling Language Is a standard language for Specifying, Visualization,


Constructing and Documenting the artifacts of software system, as well as for
business modeling and other non-software systems. The UML represents a collection
of best engineering practices that have proven successful in the modeling of large and
complex systems. The UML is a very important part of developing objects-oriented
software and the software development process. The UML uses mostly graphical
notations to express the design of software projects.

GOALS: The Primary goals in the design of the UML are as follows:

• Provide users a ready-to-use, expressive visual modeling Language so that


they can develop and exchange meaningful models.
• Provide extendibility and specialization mechanisms to extend the core
concepts.
• Be independent of particular programming languages and development
process.
• Provide a formal basis for understanding the modeling language.
• Encourage the growth of OO tools market.
• Support higher level development concepts such as collaborations,
frameworks, patterns and components.
• Integrate best practices.

20
5.1 Class Diagram

The class diagram is used to refine the use case diagram and define a detailed design
of the system. The class diagram classifies the actors defined in the use case diagram
into a set of interrelated classes. The relationship or association between the classes
can be either an “is-a” or “has-a” relationship. Each class in the class diagram may be
capable of providing certain functionalities. These functionalities provided by the
class are termed “methods” of the class. Apart from this, each class may have certain
“attributes” that uniquely identify the class.

21
5.2 Data flow diagram

A Data Flow Diagram (DFD) is a visual representation of the flow of data within a
system or process. It is a structured technique that focuses on how data moves through
different processes and data stores within an organization or a system. DFDs are
commonly used in system analysis and design to understand, document, and
communicate data flow and processing.

22
5.3 Sequence Diagram

A sequence diagram in Unified Modeling Language (UML) is a kind of interaction


diagram that shows how processes operate with one another and in what order. It is a
construct of a Message Sequence Chart. A sequence diagram shows, as parallel
vertical lines (“lifelines”), different processes or objects that live simultaneously, and
as horizontal arrows, the messages exchanged between them, in the order in which
they occur. This allows the specification of simple runtime scenarios in a graphical
manner.

23
5.4 Activity diagram

Activity diagram is another important diagram in UML to describe the dynamic


aspects of the system.

24
CHAPTER 6

SOFTWARE ENVIRONMENT

What is Python?

Below are some facts about Python.

• Python is currently the most widely used multi-purpose, high-level


programming language.
• Python allows programming in Object-Oriented and Procedural paradigms.
Python programs generally are smaller than other programming languages
like Java.
• Programmers have to type relatively less and indentation requirement of the
language, makes them readable all the time.
• Python language is being used by almost all tech-giant companies like –
Google, Amazon, Facebook, Instagram, Dropbox, Uber… etc.

The biggest strength of Python is huge collection of standard library which can be
used for the following –

• Machine Learning
• GUI Applications (like Kivy, Tkinter, PyQt etc. )
• Web frameworks like Django (used by YouTube, Instagram, Dropbox)
• Image processing (like Opencv, Pillow)
• Web scraping (like Scrapy, BeautifulSoup, Selenium)
• Test frameworks
• Multimedia

Advantages of Python

Let’s see how Python dominates over other languages.

1. Extensive Libraries

Python downloads with an extensive library and it contain code for various purposes
like regular expressions, documentation-generation, unit-testing, web browsers,

25
threading, databases, CGI, email, image manipulation, and more. So, we don’t have to
write the complete code for that manually.

2. Extensible

As we have seen earlier, Python can be extended to other languages. You can write
some of your code in languages like C++ or C. This comes in handy, especially in
projects.

3. Embeddable

Complimentary to extensibility, Python is embeddable as well. You can put your


Python code in your source code of a different language, like C++. This lets us add
scripting capabilities to our code in the other language.

4. Improved Productivity

The language’s simplicity and extensive libraries render programmers more


productive than languages like Java and C++ do. Also, the fact that you need to write
less and get more things done.

5. IOT Opportunities

Since Python forms the basis of new platforms like Raspberry Pi, it finds the future
bright for the Internet Of Things. This is a way to connect the language with the real
world.

6. Simple and Easy

When working with Java, you may have to create a class to print ‘Hello World’. But
in Python, just a print statement will do. It is also quite easy to learn, understand, and
code. This is why when people pick up Python, they have a hard time adjusting to
other more verbose languages like Java.

7. Readable

Because it is not such a verbose language, reading Python is much like reading
English. This is the reason why it is so easy to learn, understand, and code. It also
does not need curly braces to define blocks, and indentation is mandatory. This
further aids the readability of the code.

26
8. Object-Oriented

This language supports both the procedural and object-oriented programming


paradigms. While functions help us with code reusability, classes and objects let us
model the real world. A class allows the encapsulation of data and functions into one.

9. Free and Open-Source

Like we said earlier, Python is freely available. But not only can you download
Python for free, but you can also download its source code, make changes to it, and
even distribute it. It downloads with an extensive collection of libraries to help you
with your tasks.

10. Portable

When you code your project in a language like C++, you may need to make some
changes to it if you want to run it on another platform. But it isn’t the same with
Python. Here, you need to code only once, and you can run it anywhere. This is called
Write Once Run Anywhere (WORA). However, you need to be careful enough not to
include any system-dependent features.

11. Interpreted

Lastly, we will say that it is an interpreted language. Since statements are executed
one by one, debugging is easier than in compiled languages.

Any doubts till now in the advantages of Python? Mention in the comment section.

Advantages of Python Over Other Languages

1. Less Coding

Almost all of the tasks done in Python requires less coding when the same task is
done in other languages. Python also has an awesome standard library support, so you
don’t have to search for any third-party libraries to get your job done. This is the
reason that many people suggest learning Python to beginners.

27
2. Affordable

Python is free therefore individuals, small companies or big organizations can


leverage the free available resources to build applications. Python is popular and
widely used so it gives you better community support.

The 2019 Github annual survey showed us that Python has overtaken Java in the most
popular programming language category.

3. Python is for Everyone

Python code can run on any machine whether it is Linux, Mac or Windows.
Programmers need to learn different languages for different jobs but with Python, you
can professionally build web apps, perform data analysis and machine learning,
automate things, do web scraping and also build games and powerful visualizations. It
is an all-rounder programming language.

Disadvantages of Python

So far, we’ve seen why Python is a great choice for your project. But if you choose it,
you should be aware of its consequences as well. Let’s now see the downsides of
choosing Python over another language.

1. Speed Limitations

We have seen that Python code is executed line by line. But since Python is
interpreted, it often results in slow execution. This, however, isn’t a problem unless
speed is a focal point for the project. In other words, unless high speed is a
requirement, the benefits offered by Python are enough to distract us from its speed
limitations.

2. Weak in Mobile Computing and Browsers

While it serves as an excellent server-side language, Python is much rarely seen on


the client-side. Besides that, it is rarely ever used to implement smartphone-based
applications. One such application is called Carbonnelle.

The reason it is not so famous despite the existence of Brython is that it isn’t that
secure.

28
3. Design Restrictions

As you know, Python is dynamically typed. This means that you don’t need to declare
the type of variable while writing the code. It uses duck-typing. But wait, what’s that?
Well, it just means that if it looks like a duck, it must be a duck. While this is easy on
the programmers during coding, it can raise run-time errors.

4. Underdeveloped Database Access Layers

Compared to more widely used technologies like JDBC (Java DataBase Connectivity)
and ODBC (Open DataBase Connectivity), Python’s database access layers are a bit
underdeveloped. Consequently, it is less often applied in huge enterprises.

5. Simple

No, we’re not kidding. Python’s simplicity can indeed be a problem. Take my
example. I don’t do Java, I’m more of a Python person. To me, its syntax is so simple
that the verbosity of Java code seems unnecessary.

This was all about the Advantages and Disadvantages of Python Programming
Language.

Modules Used in Project

NumPy

NumPy is a general-purpose array-processing package. It provides a high-


performance multidimensional array object, and tools for working with these arrays.

It is the fundamental package for scientific computing with Python. It contains


various features including these important ones:

• A powerful N-dimensional array object


• Sophisticated (broadcasting) functions
• Tools for integrating C/C++ and Fortran code
• Useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-
dimensional container of generic data. Arbitrary datatypes can be defined using

29
NumPy which allows NumPy to seamlessly and speedily integrate with a wide variety
of databases.

Pandas

Pandas is an open-source Python Library providing high-performance data


manipulation and analysis tool using its powerful data structures. Python was majorly
used for data munging and preparation. It had very little contribution towards data
analysis. Pandas solved this problem. Using Pandas, we can accomplish five typical
steps in the processing and analysis of data, regardless of the origin of data load,
prepare, manipulate, model, and analyze. Python with Pandas is used in a wide range
of fields including academic and commercial domains including finance, economics,
Statistics, analytics, etc.

Matplotlib

Matplotlib is a Python 2D plotting library which produces publication quality figures


in a variety of hardcopy formats and interactive environments across platforms.
Matplotlib can be used in Python scripts, the Python and Ipython shells, the Jupyter
Notebook, web application servers, and four graphical user interface toolkits.
Matplotlib tries to make easy things easy and hard things possible. You can generate
plots, histograms, power spectra, bar charts, error charts, scatter plots, etc., with just a
few lines of code. For examples, see the sample plots and thumbnail gallery.

For simple plotting the pyplot module provides a MATLAB-like interface,


particularly when combined with Ipython. For the power user, you have full control of
line styles, font properties, axes properties, etc, via an object oriented interface or via
a set of functions familiar to MATLAB users.

Scikit – learn

Scikit-learn provides a range of supervised and unsupervised learning algorithms via a


consistent interface in Python. It is licensed under a permissive simplified BSD
license and is distributed under many Linux distributions, encouraging academic and
commercial use. Python

30
Install Python Step-by-Step in Windows and Mac

Python a versatile programming language doesn’t come pre-installed on your


computer devices. Python was first released in the year 1991 and until today it is a
very popular high-level programming language. Its style philosophy emphasizes code
readability with its notable use of great whitespace.

The object-oriented approach and language construct provided by Python enables


programmers to write both clear and logical code for projects. This software does not
come pre-packaged with Windows.

How to Install Python on Windows and Mac

There have been several updates in the Python version over the years. The question is
how to install Python? It might be confusing for the beginner who is willing to start
learning Python but this tutorial will solve your query. The latest or the newest
version of Python is version 3.7.4 or in other words, it is Python 3.

Note: The python version 3.7.4 cannot be used on Windows XP or earlier devices.

Before you start with the installation process of Python. First, you need to know about
your System Requirements. Based on your system type i.e. operating system and
based processor, you must download the python version. My system type is a
Windows 64-bit operating system. So the steps below are to install python version
3.7.4 on Windows 7 device or to install Python 3. Download the Python Cheatsheet
here.The steps on how to install Python on Windows 10, 8 and 7 are divided into 4
parts to help understand better.

Download the Correct version into the system

Step 1: Go to the official site to download and install python using Google Chrome or
any other web browser. OR Click on the following link: https://www.python.org

31
Now, check for the latest and the correct version for your operating system.

Step 2: Click on the Download Tab.

Step 3: You can either select the Download Python for windows 3.7.4 button in
Yellow Color or you can scroll further down and click on download with respective to
their version. Here, we are downloading the most recent python version for windows
3.7.4

32
Step 4: Scroll down the page until you find the Files option.

Step 5: Here you see a different version of python along with the operating system.

• To download Windows 32-bit python, you can select any one from the three
options: Windows x86 embeddable zip file, Windows x86 executable installer
or Windows x86 web-based installer.
• To download Windows 64-bit python, you can select any one from the three
options: Windows x86-64 embeddable zip file, Windows x86-64 executable
installer or Windows x86-64 web-based installer.

Here we will install Windows x86-64 web-based installer. Here your first part
regarding which version of python is to be downloaded is completed. Now we move
ahead with the second part in installing python i.e. Installation

Note: To know the changes or updates that are made in the version you can click on
the Release Note Option.

33
Installation of Python

Step 1: Go to Download and Open the downloaded python version to carry out the
installation process.

Step 2: Before you click on Install Now, Make sure to put a tick on Add Python 3.7 to
PATH.

Step 3: Click on Install NOW After the installation is successful. Click on Close.

34
With these above three steps on python installation, you have successfully and
correctly installed Python. Now is the time to verify the installation.

Note: The installation process might take a couple of minutes.

Verify the Python Installation

Step 1: Click on Start

Step 2: In the Windows Run Command, type “cmd”.

Step 3: Open the Command prompt option.

35
Step 4: Let us test whether the python is correctly installed. Type python –V and press
Enter.

Step 5: You will get the answer as 3.7.4

Note: If you have any of the earlier versions of Python already installed. You must
first uninstall the earlier version and then install the new one.

Check how the Python IDLE works

Step 1: Click on Start

Step 2: In the Windows Run command, type “python idle”.

Step 3: Click on IDLE (Python 3.7 64-bit) and launch the program

Step 4: To go ahead with working in IDLE you must first save the file. Click on File >
Click on Save

36
Step 5: Name the file and save as type should be Python files. Click on SAVE. Here I
have named the files as Hey World.

Step 6: Now for e.g. enter print (“Hey World”) and Press Enter.

You will see that the command given is launched. With this, we end our tutorial on
how to install Python. You have learned how to download python for windows into
your respective operating system.

Note: Unlike Java, Python does not need semicolons at the end of the statements
otherwise it won’t work.

37
CHAPTER 7

SYSTEM REQUIREMENTS

7.1 Software Requirements

The functional requirements or the overall description documents include the product
perspective and features, operating system and operating environment, graphics
requirements, design constraints and user documentation.

The appropriation of requirements and implementation constraints gives the general


overview of the project in regard to what the areas of strength and deficit are and how
to tackle them.

• Python IDLE 3.7 version (or)


• Anaconda 3.7 (or)
• Jupiter (or)
• Google colab

7.2 Hardware Requirements

Minimum hardware requirements are very dependent on the particular software being
developed by a given Enthought Python / Canopy / VS Code user. Applications that
need to store large arrays/objects in memory will require more RAM, whereas
applications that need to perform numerous calculations or tasks more quickly will
require a faster processor.

• Operating system : Windows, Linux


• Processor : minimum intel i3
• Ram : minimum 4 GB
• Hard disk : minimum 250GB

38
CHAPTER 8

FUNCTIONAL REQUIREMENTS

Output Design

Outputs from computer systems are required primarily to communicate the results of
processing to users. They are also used to provides a permanent copy of the results for
later consultation. The various types of outputs in general are:

• External Outputs, whose destination is outside the organization


• Internal Outputs whose destination is within organization and they are the
• User’s main interface with the computer.
• Operational outputs whose use is purely within the computer department.
• Interface outputs, which involve the user in communicating directly.

Output Definition

The outputs should be defined in terms of the following points:

• Type of the output


• Content of the output
• Format of the output
• Location of the output
• Frequency of the output
• Volume of the output
• Sequence of the output

It is not always desirable to print or display data as it is held on a computer. It should


be decided as which form of the output is the most suitable.

Input Design

Input design is a part of overall system design. The main objective during the input
design is as given below:

• To produce a cost-effective method of input.


• To achieve the highest possible level of accuracy.

39
• To ensure that the input is acceptable and understood by the user.

Input Stages

The main input stages can be listed as below:

• Data recording
• Data transcription
• Data conversion
• Data verification
• Data control
• Data transmission
• Data validation
• Data correction

Input Types

It is necessary to determine the various types of inputs. Inputs can be categorized as


follows:

• External inputs, which are prime inputs for the system.


• Internal inputs, which are user communications with the system.
• Operational, which are computer department’s communications to the system?
• Interactive, which are inputs entered during a dialogue.

Input Media

At this stage choice has to be made about the input media. To conclude about the
input media consideration has to be given to;

• Type of input
• Flexibility of format
• Speed
• Accuracy
• Verification methods
• Rejection rates
• Ease of correction
• Storage and handling requirements

40
• Security
• Easy to use
• Portability

Keeping in view the above description of the input types and input media, it can be
said that most of the inputs are of the form of internal and interactive. As

Input data is to be the directly keyed in by the user, the keyboard can be considered to
be the most suitable input device.

Error Avoidance

At this stage care is to be taken to ensure that input data remains accurate form the
stage at which it is recorded up to the stage in which the data is accepted by the
system. This can be achieved only by means of careful control each time the data is
handled.

Error Detection

Even though every effort is make to avoid the occurrence of errors, still a small
proportion of errors is always likely to occur, these types of errors can be discovered
by using validations to check the input data.

Data Validation

Procedures are designed to detect errors in data at a lower level of detail. Data
validations have been included in the system in almost every area where there is a
possibility for the user to commit errors. The system will not accept invalid data.
Whenever an invalid data is keyed in, the system immediately prompts the user and
the user has to again key in the data and the system will accept the data only if the
data is correct. Validations have been included where necessary.

The system is designed to be a user friendly one. In other words the system has been
designed to communicate effectively with the user. The system has been designed
with popup menus.

41
User Interface Design

It is essential to consult the system users and discuss their needs while designing the
user interface:

User Interface Systems Can Be Broadly Clasified As:

• User initiated interface the user is in charge, controlling the progress of the
user/computer dialogue. In the computer-initiated interface, the computer
selects the next stage in the interaction.
• Computer initiated interfaces

In the computer-initiated interfaces the computer guides the progress of the


user/computer dialogue. Information is displayed and the user response of the
computer takes action or displays further information.

User Initiated Interfaces

User initiated interfaces fall into two approximate classes:

• Command driven interfaces: In this type of interface the user inputs


commands or queries which are interpreted by the computer.
• Forms oriented interface: The user calls up an image of the form to his/her
screen and fills in the form. The forms-oriented interface is chosen because it
is the best choice.

Computer-Initiated Interfaces

The following computer – initiated interfaces were used:

• The menu system for the user is presented with a list of alternatives and the
user chooses one; of alternatives.
• Questions – answer type dialog system where the computer asks question and
takes action based on the basis of the users reply.

Right from the start the system is going to be menu driven, the opening menu displays
the available options. Choosing one option gives another popup menu with more
options. In this way every option leads the users to data entry form where the user
can key in the data.

42
Error Message Design

The design of error messages is an important part of the user interface design. As
user is bound to commit some errors or other while designing a system the system
should be designed to be helpful by providing the user with information regarding the
error he/she has committed.

This application must be able to produce output at different modules for different
inputs.

Performance Requirements

Performance is measured in terms of the output provided by the application.


Requirement specification plays an important part in the analysis of a system. Only
when the requirement specifications are properly given, it is possible to design a
system, which will fit into required environment. It rests largely in the part of the
users of the existing system to give the requirement specifications because they are
the people who finally use the system. This is because the requirements have to be
known during the initial stages so that the system can be designed according to those
requirements. It is very difficult to change the system once it has been designed and
on the other hand designing a system, which does not cater to the requirements of the
user, is of no use.

The requirement specification for any system can be broadly stated as given below:

• The system should be able to interface with the existing system


• The system should be accurate
• The system should be better than the existing system
• The existing system is completely dependent on the user to perform all the
duties.

43
CHAPTER 9

SOURCE CODE

# # Machine learning model for Message Queuing Telemetry Transport Data


Analytics

import pandas as pd

import numpy as np

import os

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.feature_selection import SelectKBest, f_classif

from sklearn.preprocessing import LabelEncoder

from sklearn.metrics import confusion_matrix,classification_report,accuracy_score

from sklearn.neighbors import KNeighborsClassifier

from sklearn.ensemble import RandomForestClassifier

import joblib

dataset=pd.read_csv("mqttdataset_reduced.csv")

dataset

# Replace 0x00000000 with 0 in the DataFrame

dataset.replace(0x00000000, 0, inplace=True)

dataset['target'].unique()

labels=('legitimate', 'dos', 'slowite', 'malformed', 'bruteforce', 'flood')

44
dataset.isnull().sum()

dataset.info()

dataset.columns

# Create the count plot

sns.set(style="darkgrid")

plt.figure(figsize=(8, 6))

sns.countplot(data=dataset, x='target', palette="Set2")

# Add count labels on top of each bar

ax = plt.gca()

for p in ax.patches:

ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2, p.get_height()),


ha='center', va='bottom')

# Set axis labels and plot title (customize as needed)

plt.xlabel("Categories")

plt.ylabel("Count")

plt.title("Count Plot")

plt.show()

X=dataset.iloc[:,:-1]

X.head()

y=dataset.iloc[:,-1]

45
y

#Label encoding

le=LabelEncoder()

y=le.fit_transform(y)

X['tcp.flags'] = le.fit_transform(X['tcp.flags'])

X['mqtt.conflags'] = le.fit_transform(X['mqtt.conflags'])

X['mqtt.hdrflags'] = le.fit_transform(X['mqtt.hdrflags'])

X['mqtt.msg'] = le.fit_transform(X['mqtt.msg'])

X['mqtt.protoname'] = le.fit_transform(X['mqtt.protoname'])

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

#Random forest

# Define the model filename

filename='randomforestmodel.pkl'

# Check if the model file exists

if os.path.exists(filename):

#Load the model from the file

rf = joblib.load(filename)

46
else:

rf = RandomForestClassifier()

# Training the model on the training data

rf.fit(X_train, y_train)

# Predicting the target variable on the test data

y_pred = rf.predict(X_test)

# Evaluating the model's performance

accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy}")

# Printing a classification report for more detailed evaluation

print("Classification Report:\n", classification_report(y_test, y_pred))

#Confusion_matrix

cm=confusion_matrix(y_test, y_pred)

# Create a figure and axis for the plot

fig, ax = plt.subplots()

# Create a heatmap using seaborn

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=labels,


yticklabels=labels, ax=ax)

47
# Set labels and title

ax.set_xlabel('Predicted')

ax.set_ylabel('Actual')

ax.set_title('Confusion Matrix of RF Model')

# Show the plot

plt.show()

#KNN

# Define the model filename

model_filename = 'model.pkl'

# Check if the model file exists

if os.path.exists(model_filename):

# Load the model from the file

knn = joblib.load(model_filename)

else:

knn = KNeighborsClassifier(n_neighbors=5)

knn.fit(X_train, y_train)

joblib.dump(knn, model_filename)

# Predicting the target variable on the test data

y_pred1 = knn.predict(X_test)

# Evaluating the model's performance

accuracy = accuracy_score(y_test, y_pred1)

48
print(f"Accuracy: {accuracy}")

# Printing a classification report for more detailed evaluation

print("Classification Report:\n", classification_report(y_test, y_pred1))

cm=confusion_matrix(y_test, y_pred1)

# Create a figure and axis for the plot

fig, ax = plt.subplots()

# Create a heatmap using seaborn

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=labels,


yticklabels=labels, ax=ax)

# Set labels and title

ax.set_xlabel('Predicted')

ax.set_ylabel('Actual')

ax.set_title('Random_forest_Confusion Matrix')

# Show the plot

plt.show()

# Prediction on test data

A='legitimate'

49
B='dos'

C='slowite'

D='malformed'

E='bruteforce'

F='flood'

predict = knn.predict(X_test)

for i in range(len(predict)):

if predict[i] == 0:

print("{}:***********************************{}".format(X_test.iloc[i,:],A))

elif predict[i]== 1:

print("{} :***********************************{} ".format(X_test.iloc[i,


:],B))

elif predict[i]== 2:

print("{} :***********************************{} ".format(X_test.iloc[i,


:],C))

elif predict[i]==3:

print("{} :***********************************{} ".format(X_test.iloc[i,


:],D))

elif predict[i]==4:

print("{} :***********************************{} ".format(X_test.iloc[i,


:],E))

else:

print("{} :***********************************{}".format(X_test.iloc[i,

:],F))

50
CHAPTER 10

RESULTS AND DISCUSSION

10.1 Implementation description

This project is for building and evaluating machine learning models for MQTT data
analytics, which is a lightweight messaging protocol often used in IoT applications.
This loads MQTT data, preprocesses it, builds and evaluates two machine learning
models (Random Forest and KNN) for classification, and provides insights into the
model performance through visualizations and predictions on test data.

1. Import Libraries: Import various Python libraries, including Pandas, NumPy,


Matplotlib, Seaborn, and scikit-learn (a machine learning library).

2. Load Dataset: Read a CSV file named "mqttdataset_reduced.csv" into a Pandas


DataFrame called dataset.

3. Data Preprocessing

• Replace the value 0x00000000 with 0 throughout the DataFrame.


• Encode the categorical target variable 'target' using Label Encoding.
• Encode several categorical features in the 'X' DataFrame, including 'tcp.flags',
'mqtt.conflags', 'mqtt.hdrflags', 'mqtt.msg', and 'mqtt.protoname'.

4. Data Splitting: Split the data into training and testing sets using scikit-learn's
train_test_split function. The training set contains 80% of the data, while the testing
set contains 20%.

4. Random Forest Classifier

• Check if a pre-trained Random Forest classifier model exists in a file named


'randomforestmodel.pkl'. If it exists, load the model; otherwise, train a new
Random Forest classifier on the training data and save it to the file.
• Use the trained Random Forest model to make predictions on the test data.
• Calculate and print the accuracy of the Random Forest model.
• Print a classification report with detailed evaluation metrics.
• Create and display a confusion matrix heatmap.

51
5. KNN Classifier
• Check if a pre-trained KNN classifier model exists in a file named 'model.pkl'.
If it exists, load the model; otherwise, train a new KNN classifier on the
training data and save it to the file.
• Use the trained KNN model to make predictions on the test data.
• Calculate and print the accuracy of the KNN model.
• Print a classification report with detailed evaluation metrics.
• Create and display a confusion matrix heatmap.

6. Prediction on Test Data

• For each prediction in the test data:


• Determine the predicted class label (e.g., 'legitimate', 'dos', 'slowite', etc.).
• Print the features of the test data point along with the predicted class label.

7. Data Visualization: Create a count plot of the 'target' variable to visualize the
distribution of different MQTT message types.

10.2 Dataset description

The dataset is related to network traffic data, specifically focused on TCP


(Transmission Control Protocol) and MQTT protocols. Here is description of the
fields in the dataset:

tcp.flags: This refers to the flags used in TCP packets. TCP flags are control bits
within the TCP header that indicate various conditions and control actions in the TCP
connection.

tcp.time_delta: This represents the time difference between consecutive TCP


packets, indicating the time duration between their transmissions.

tcp.len: This refers to the length of the TCP packet in bytes.

mqtt.conack.flags: It relate to the flags used in MQTT connection acknowledgment


packets.

mqtt.conack.flags.reserved, mqtt.conack.flags.sp, mqtt.conack.val: These fields


associated with different aspects of MQTT connection acknowledgment flags and
values.
52
mqtt.conflag.cleansess, mqtt.conflag.passwd, mqtt.conflag.qos, mqtt.conflag.reserved,
mqtt.conflag.retain, mqtt.conflag.uname, mqtt.conflag.willflag: These fields pertain to
various connection flags in MQTT packets, indicating aspects of the MQTT
connection setup.

mqtt.conflags: This is a combination of MQTT connection flags.

mqtt.dupflag: This feature indicates whether the MQTT packet is a duplicate.

mqtt.hdrflags: These flags are related to the MQTT packet header.

mqtt.kalive: This represents the keep-alive interval for the MQTT connection.

mqtt.len: The length of the MQTT packet.

mqtt.msg: This contains the content of the MQTT message.

mqtt.msgid: The identifier associated with the MQTT message.

mqtt.msgtype: The type of MQTT message (e.g., publish, subscribe, etc.).

mqtt.proto_len, mqtt.protoname: These fields are related to the MQTT protocol and
its length.

mqtt.qos: The quality-of-service level associated with the MQTT message.

mqtt.retain: Whether the MQTT message should be retained by the broker.

mqtt.sub.qos: The quality-of-service level for MQTT subscription messages.

mqtt.ver: The version of MQTT being used.

mqtt.willmsg, mqtt.willmsg_len: These fields are related to the MQTT "will"


message content and its length.

mqtt.willtopic, mqtt.willtopic_len: These fields pertain to the MQTT "will"


message's topic and its length.

target: This field is the target or label associated with each data entry, which is used
for classification purposes.

53
Target Variables

Legitimate: This category refers to network traffic that is considered normal and
acceptable. It includes genuine connections, messages, or activities that are in line
with the expected behavior of the network. Legitimate traffic is typically free from
malicious intent or anomalies.

DoS (Denial of Service): Denial of Service is a type of attack in which a malicious


actor floods a network or system with excessive traffic, causing it to become
overwhelmed and unavailable to legitimate users. This disrupts the normal operation
of the targeted network or system.

Slowite: It's a type of network behavior that causes a slowdown or degradation in the
performance of the network. It will occur be due to various factors such as inefficient
communication, bottlenecks, or other issues.

Malformed: "Malformed" refers to network packets or messages that do not adhere to


the expected protocol specifications. These are the messages with incorrect
formatting, missing information, or data that doesn't make sense in the context of the
protocol. Malformed packets can sometimes indicate attempts to exploit
vulnerabilities or disrupt communication.

Brute Force: Brute force attacks involve systematically trying all possible
combinations of passwords or keys until the correct one is found. In the context of
network security, a brute force attack might be an attempt to gain unauthorized access
to a system by repeatedly trying different credentials.

Flood: A "flood" typically refers to a large and rapid influx of data or requests. In the
context of network traffic, a flood attack involves overwhelming a system or network
with a massive volume of traffic. This can lead to congestion and potentially disrupt
the normal functioning of the targeted network.

10.3 Results and description

Figure 1 represents a graphical representation of the MQTT dataset used in the


project. It shows a portion of the dataset with rows and columns, where each row
represents a data point (e.g., an MQTT message) and columns represent different
features (e.g., message attributes). Figure 2 is a graphical representation of a count

54
plot. It displays the distribution of different target categories or message types in the
MQTT dataset. Each bar in the plot corresponds to a specific message type, and the
height of the bar represents the count or frequency of that message type in the dataset.
This visualization helps us understand the class distribution, which can be crucial for
assessing potential class imbalances.

Figure 3 is a visual representation of the classification report and accuracy metrics


obtained from the RF classification model. The classification report typically includes
metrics such as precision, recall, F1-score, and support for each class (message type).
Accuracy represents the overall correct classification rate.

55
Figure 10.1: Sample dataset used for MQTT data analytics.

Figure 10.1: Count plot of target categories.

56
Figure 10.2: Classification report and accuracy obtained using RF classification.

Figure 4 is an illustration of a confusion matrix specific to the Random Forest


classification model. A confusion matrix is a table that displays the model's
classification results, showing the number of true positives, true negatives, false
positives, and false negatives for each class (message type). It is often visualized as a
heatmap to provide a clear overview of the model's performance in terms of correct
and incorrect classifications.

Figure 5 provides visual representations of the accuracy and classification report


metrics obtained from the KNN classification model. Like Figure 3, this also displays
precision, recall, F1-score, support, and overall accuracy for each message type class.
It offers a summary of how well the KNN model performs in classifying MQTT data.

57
Figure 10.2: Illustration of Confusion matrix obtained using RF classification on
MQTT data.

Figure 10.3: Accuracy and classification report obtained using KNN classification.

Figure 10.3: Displaying the confusion matrix obtained using KNN classification.

58
CHAPTER 11

CONCLUSION AND FUTURE SCOPE

In the context of MQTT data analytics, this project has successfully accomplished its
primary objective of developing and evaluating machine learning models for
classifying MQTT messages into distinct categories or message types. The Random
Forest Classifier and KNN Classifier have been effectively implemented and assessed
for their classification capabilities. Through accuracy measurements and the
generation of comprehensive classification reports, both models have demonstrated
their competence in categorizing MQTT data with accuracy and precision. One
notable feature of the project is the visualization of model performance using
confusion matrices. These matrices provide a clear representation of the models'
effectiveness by depicting true positives, true negatives, false positives, and false
negatives. Such visual insights are invaluable for understanding the models' strengths
and weaknesses.

To ensure data readiness for machine learning, the project engaged in critical data
preprocessing tasks, including data cleansing and label encoding. This preprocessing
step plays a pivotal role in ensuring that the data is well-suited for analysis and
classification. Moreover, the project encompasses mechanisms for persisting trained
models. By saving and loading pre-trained models, the efficiency of handling new
MQTT data is greatly improved. This obviates the need for retraining models from
scratch, streamlining the process of classifying incoming data. The project also
devoted attention to data exploration through count plots, which provide a visual
representation of the distribution of MQTT message types. This exploration step aids
in comprehending the dataset's class distribution, which can be crucial for subsequent
decision-making and analysis.

Future Scope

There are several scopes for enhancing this MQTT data analytics project. Firstly,
optimizing the models' hyperparameters can be considered to potentially boost
classification performance. Techniques like grid search or random search can be
employed to systematically identify the most effective hyperparameter configurations.

59
Additionally, model ensemble strategies, such as stacking or bagging, can be
investigated to combine the strengths of multiple models, potentially leading to
improved classification accuracy. Further, anomaly detection techniques could be
integrated to identify unusual or unexpected MQTT messages, which can be
indicative of security breaches or system irregularities.

60
CHAPTER 12

REFERENCES
[1] Al-Masri, E.; Kalyanam, K.R.; Batts, J.; Kim, J.; Singh, S.; Vo, T.; Yan, C. Investigating
messaging protocols for the Internet of Things (IoT). IEEE Access 2020, 8, 94880–94911.
[2] Kodali, R.K.; Soratkal, S. MQTT Based Home Automation System Using ESP8266. In
Proceedings of the 2016 IEEE Region 10 Humanitarian Technology Conference (R10-HTC),
Agra, India, 21–23 December 2016; pp. 1–5.
[3] Cornel-Cristian, A.; Gabriel, T.; Arhip-Calin, M.; Zamfirescu, A. Smart Home Automation with
MQTT. In Proceedings of the 2019 54th International Universities Power Engineering
Conference (UPEC), Bucharest, Romania, 3–6 September 2019; pp. 1–5.
[4] Prabaharan, J.; Swamy, A.; Sharma, A.; Bharath, K.N.; Mundra, P.R.; Mohammed, K.J.
Wireless Home Automation and Securitysystem Using MQTT Protocol. In Proceedings of the
2017 2nd IEEE International Conference on Recent Trends in Electronics, Information &
Communication Technology (RTEICT), Bangalore, India, 19–20 May 2017; pp. 2043–2045.
[5] Kodali, R.K.; Sarjerao, B.S. A Low Cost Smart Irrigation System Using MQTT Protocol. In
Proceedings of the 2017 IEEE Region 10 Symposium (TENSYMP), Cochin, India, 14–16 July
2017; pp. 1–5.
[6] Mukherji, S.V.; Sinha, R.; Basak, S.; Kar, S.P. Smart Agriculture Using Internet of Things and
mqtt Protocol. In Proceedings of the 2019 International Conference on Machine Learning, Big
Data, Cloud and Parallel Computing (COMITCon), Faridabad, India, 14–16 February 2019; pp.
14–16.
[7] Rayes, A.; Salam, S. Internet of Things from Hype to Reality—The Road to Digitization, 2nd
ed.; Springer: Cham, Switzerland, 2019.
[8] Anam, M.; Ponnusamy, V.; Hussain, M.; Nadeem, M.W.; Javed, M.; Goh, H.G.; Qadeer, S.
Osteoporosis Prediction for Trabecular Bone Using Machine Learning: A Review. Comput.
Mater. Contin. 2021, 67, 89–105.
[9] Polat, H.; Polat, O.; Cetin, A. Detecting DDoS Attacks in Software-Defined Networks Through
Feature Selection Methods and Machine Learning Models. Sustainability 2020, 12, 1035.
[10] Ochôa, I.S.; Leithardt, V.R.Q.; Calbusch, L.; Santana, J.F.D.P.; Parreira, W.D.; Seman, L.O.;
Zeferino, C.A. Performance and Security Evaluation on a Blockchain Architecture for License
Plate Recognition Systems. Appl. Sci. 2021, 11, 1255.
[11] Anjos, J.C.S.D.; Gross, J.L.G.; Matteussi, K.J.; González, G.V.; Leithardt, V.R.Q.; Geyer,
C.F.R. An Algorithm to Minimize Energy Consumption and Elapsed Time for IoT Workloads
in a Hybrid Architecture. Sensors 2021, 21, 2914.
[12] Ganguly, S.; Garofalakis, M.; Rastogi, R.; Sabnani, K. Streaming Algorithms for Robust, Real-
Time Detection of ddos Attacks. In Proceedings of the 27th International Conference on
Distributed Computing Systems (ICDCS’07), Toronto, ON, Canada, 25–27 June 2007; p. 4.

61
[13] Soni, D.; Makwana, A. A Survey on mqtt: A Protocol of Internet of Things (Iot). In
Proceedings of the International Conference on Telecommunication, Power Analysis and
Computing Techniques (ICTPACT-2017), Chennai, India, 6–8 April 2017; Volume 20.
[14] Hunkeler, U.; Truong, H.L.; Stanford-Clark, A. MQTT-S—A Publish/Subscribe Protocol for
Wireless Sensor Networks. In Proceedings of the 2008 3rd International Conference on
Communication Systems Software and Middleware and Workshops (COMSWARE’08),
Bangalore, India, 6–10 January 2008; pp. 791–798.
[15] Ahmadon, M.A.B.; Yamaguchi, N.; Yamaguchi, S. Process-Based Intrusion Detection Method
for IoT System with MQTT Protocol. In Proceedings of the 2019 IEEE 8th Global Conference
on Consumer Electronics (GCCE), Osaka, Japan, 15–18 October 2019; pp. 953–956.
[16] Jan, S.U.; Lee, Y.D.; Koo, I.S. A distributed sensor-fault detection and diagnosis framework
using machine learning. Inf. Sci. 2021, 547, 777–796.
[17] Alaiz-Moreton, H.; Aveleira-Mata, J.; Ondicol-Garcia, J.; Muñoz-Castañeda, A.L.; García, I.;
Benavides, C. Multiclass classification procedure for detecting attacks on MQTT-IoT protocol.
Complexity 2019, 2019, 6516253.
[18] Hindy, H.; Bayne, E.; Bures, M.; Atkinson, R.; Tachtatzis, C.; Bellekens, X. Machine Learning
Based IoT Intrusion Detection System: An MQTT Case Study (MQTT-IoT-IDS2020 Dataset).
In Proceedings of the International Networking Conference, Online, 19–21 September 2020;
Springer: Cham, Switzerland, 2020; pp. 73–84.
[19] Ullah, I.; Ullah, A.; Sajjad, M. Towards a Hybrid Deep Learning Model for Anomalous
Activities Detection in Internet of Things Networks. IoT 2021, 2, 428–448.
[20] Almaiah, M.A.; Almomani, O.; Alsaaidah, A.; Al-Otaibi, S.; Bani-Hani, N.; Hwaitat, A.K.A.;
Al-Zahrani, A.; Lutfi, A.; Awad, A.B.; Aldhyani, T.H.H. Performance Investigation of Principal
Component Analysis for Intrusion Detection System Using Different Support Vector Machine
Kernels. Electronics 2022, 11, 3571.
[21] Shalaginov, A.; Semeniuta, O.; Alazab, M. MEML: Resource-Aware MQTT-Based Machine
Learning for Network Attacks Detection on IoT Edge Devices. In Proceedings of the 12th
IEEE/ACM International Conference on Utility and Cloud Computing Companion, Auckland,
New Zealand, 2–5 December 2019; pp. 123–128.
[22] Ujjan, R.M.A.; Pervez, Z.; Dahal, K.; Khan, W.A.; Khattak, A.M.; Hayat, B. Entropy Based
Features Distribution for Anti-DDoS Model in SDN. Sustainability 2021, 13, 1522.
[23] Gadze, J.D.; Bamfo-Asante, A.A.; Agyemang, J.O.; Nunoo-Mensah, H.; Opare, K.A.-B. An
Investigation into the Application of Deep Learning in the Detection and Mitigation of DDOS
Attack on SDN Controllers. Technologies 2021, 9, 14.
[24] Ahuja, N.; Singal, G.; Mukhopadhyay, D.; Kumar, N. Automated DDOS attack detection in
software defined networking. J. Netw. Comput. Appl. 2021, 187, 103108.
[25] Wang, Z.; Zeng, Y.; Liu, Y.; Li, D. Deep belief network integrating improved kernel-based
extreme learning machine for network intrusion detection. IEEE Access 2021, 9, 16062–16091.

62

You might also like