0% found this document useful (0 votes)
24 views22 pages

IIOT Unit 3 NOTES

The document discusses IIoT analytics, emphasizing the importance of big data analytics in managing the overwhelming data generated by IoT devices. It highlights the need for effective data categorization, the challenges of real-time data analysis, and the role of software-defined networking (SDN) in enhancing network flexibility and security. Additionally, it covers machine learning techniques, including supervised and unsupervised learning, as essential tools for deriving actionable insights from IoT data.

Uploaded by

linren2005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views22 pages

IIOT Unit 3 NOTES

The document discusses IIoT analytics, emphasizing the importance of big data analytics in managing the overwhelming data generated by IoT devices. It highlights the need for effective data categorization, the challenges of real-time data analysis, and the role of software-defined networking (SDN) in enhancing network flexibility and security. Additionally, it covers machine learning techniques, including supervised and unsupervised learning, as essential tools for deriving actionable insights from IoT data.

Uploaded by

linren2005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

UNIT-III

IIOT ANALYTICS
Big Data Analytics and Software Defined Networks, Machine Learning and Data Science, Julia
Programming, Data Management with Hadoop.

Big Data Analytics

IoT data is just a curiosity, and it’s even useful if handled correctly. However,
given time, as more and more devices are added to IoT networks, the data
generated by these systems becomes overwhelming.
The real value of IoT is not just in connecting things but rather in the data
produced by those things, the new services you can enable via those connected
things, and the business insights that the data can reveal.

However, to be useful, the data needs to be handled in a way that is organized


and controlled. Thus, a new approach to data analytics is needed for the
Internet of Things.
Introduction to Data Analytics for IOT

In the world of IoT, the creation of massive amounts of data from sensors is
common and one of the biggest challenges—not only from a transport
perspective but also from a data management standpoint.

Analysing large amount of data in the most efficient manner possible falls
under the umbrella of data analytics.

Data analytics must be able to offer actionable insights and knowledge from
data, no matter the amount or style, in a timely manner, or the full benefits of
IoT cannot be realized.

Example:

Modern jet engines are fitted with thousands of sensors that generate a
whopping 10GB of data per second may be equipped with around 5000

sensors. Therefore, a twin engine commercial aircraft with these engines


operating on average 8 hours a day will generate over 500 TB of data daily, and
this is just the data from the engines! Aircraft today have thousands of other
sensors connected to the airframe and other systems.

In fact, a single wing of a modern jumbo jet is equipped with 10,000 sensors.

The potential for a petabyte (PB) of data per day per commercial airplane is

not farfetched—and this is just for one airplane. Across the world, there are
approximately 100,000 commercial flights per day. The amount of IoT data
coming just from the commercial airline business is overwhelming.
IIoT Analytics: Data Science

 Big Data Analytics


 Volume, velocity, variability, veracity, variety
 Industrial automation, system health monitoring, predictive
maintenance, remote monitoring
 Artificial Intelligence
 Deep Learning (DL)
 Machine Learning (ML)

Instead of physics-based models, ML and DL enable a data-driven system

modelling approach

Key concepts related to data


➢ Not all data are same it can be categorized and thus analysed in different
ways.

➢ Depending on how data is categorized, various data analytics tools and


processing methods can be applied.

➢ Two important categorizations from an IoT perspective are whether the


data is structured or unstructured and whether it is in motion or at rest.

Structured versus Unstructured Data

Structured data and unstructured data are important classifications as they


typically require different toolsets from a data analytics perspective.

Structured data means that the data follows a model or schema that defines
how the data is represented or organized, meaning it fits well with a traditional
relational database management system (RDBMS).

Simply put, structured data typically refers to highly organized, stored


information that is efficiently and easily searchable.

IoT sensor data often uses structured values, such as temperature, pressure,
humidity, and so on, which are all sent in a known format. Structured data is
easily formatted, stored, queried, and processed; for these reasons, it has been
the core type of data used for making business decisions.

Because of the highly organizational format of structured data, a wide array of


data analytics tools are readily available for processing this type of data.

From custom scripts to commercial software like Microsoft Excel and Tableau,
most people are familiar and comfortable with working with structured data.

Unstructured data lacks a logical schema for understanding and decoding the
data through traditional programming means.

Examples of this data type include text, speech, images, and video. As a general
rule, any data that does not fit neatly into a predefined data model is classified
as unstructured data. such as cognitive computing and machine learning, are
deservedly garnering a lot of attention.

According to some estimates, around 80% of a business’s data is


unstructured.2 Because of this fact, data analytics methods that can be applied
to unstructured data, such as cognitive computing and machine learning, are
deservedly garnering a lot of attention.

With machine learning applications, such as natural language processing (NLP),


you can decode speech. With image/facial recognition applications, you can
extract critical information from still images and video. The handling of
unstructured IoT data employing machine learning techniques is cove red in
more depth later in this chapter.

Semi-structured data is sometimes included along with structured and


unstructured data. As you can probably guess, semi-structured data is a hybrid
of structured and unstructured data and shares characteristics of both. While
not relational, semi-structured data contains a certain schema and consistency.
Email is a good example of semi-structured data as the fields are well defined
but the content contained in the body field and attachments is unstructured.

Smart objects in IoT networks generate both structured and unstructured data.
Structured data is more easily managed and processed due to its welldefined
organization.

On the other hand, unstructured data can be harder to deal with and typically
requires very different analytics tools for processing the data.

Data in M otion versus Data at Rest

As in most networks, data in IoT networks is either in transit (“data in motion”)


or being held or stored (“data at rest”).

Examples of data in motion include traditional client/server exchanges, such as


web browsing and file transfers, and email.

Data saved to a hard drive, storage array, or USB drive is data at rest.

➢ From an IoT perspective, the data from smart objects is considered data
in motion as it passes through the network en route to its final destination.

➢ This is often processed at the edge, using fog computing. When data is
processed at the edge, it may be filtered and deleted or forwarded on for further
processing and possible storage at a fog node or in the data center.

➢ Data does not come to rest at the edge.

➢ When data arrives at the data center, it is possible to process it in real-


time, just like at the edge, while it is still in motion

➢ Tools with this sort of capability, such as Spark, Storm, and Flink, are
relatively nascent compared to the tools for analysing stored data.

Data at rest in IoT networks can be typically found in IoT brokers or in some
sort of storage array at the data center. Myriad tools, especially tools for
structured data in relational databases, are available from a data analytics
perspective.

The best known of these tools is Hadoop. Hadoop not only helps with data
processing but also data storage. IoT Data Analytics Overview

The true importance of IoT data from smart objects is realized only when the
analysis of the data leads to actionable business intelligence and insights.

Data analysis is typically broken down by the types of results that are
produced.

Descriptive: Descriptive data analysis tells you what is happening, either now or
in the past.

Diagnostic: When you are interested in the “why,” diagnostic data analysis can
provide the answer.

Predictive: Predictive analysis aims to foretell problems or issues before they


occur.

Prescriptive:Prescriptive analysis goes a step beyond


predictive and recommends solutions for upcoming problems.

Both predictive and prescriptive analyses are more resource intensive and
increase complexity, but the value they provide is much greater than the value
from descriptive and diagnostic analysis.

Figure 7-4 illustrates the four data analysis types and how they rank as
complexity and value increase. You can see that descriptive analysis is the least
complex and at the same time offers the least value. On the other end,
prescriptive analysis provides the most value but is the most complex to
implement.

Most data analysis in the IoT space relies on descriptive and diagnostic
analysis, but a shift toward predictive and prescriptive analysis is
understandably occurring for most businesses and organizations.

IoT Data Analytics Challenges

IoT data places two specific challenges on a relational database:

Scaling problems: Due to the large number of smart objects in most IoT
networks that continually send data, relational databases can grow incredibly
large very quickly. This can result in performance issues that can be costly to
resolve, often requiring more hardware and architecture changes.

Volatility of data: With relational databases, it is critical that the schema be


designed correctly from the beginning. Changing it later can slow or stop the
database from operating. Due to the lack of flexibility, revisions to the schema

must be kept at a minimum. IoT data, however, is volatile in the sense that the
data model is likely to change and evolve over time.
Some other challenges:

• IoT also brings challenges with the live streaming nature of its data and
with managing data at the network level. Streaming data, which is generated as
smart objects transmit data, is challenging because it is usually of a very high
volume, and it is valuable only if it is possible to analyse and respond to it in
real-time.

• Real-time analysis of streaming data allows you to detect patterns or


anomalies that could indicate a problem or a situation that needs some kind of
immediate response. To have a chance of affecting the outcome of this problem,
you naturally must be able to filter and analyse the data while it is occurring,
as close to the edge as possible.

• The market for analysing streaming data in real-time is growing fast.


Major cloud analytics providers, such as Google, Microsoft, and IBM, have
streaming analytics offerings, and various other applications can be used in
house.

• Another challenge that IoT brings to analytics is in the area of network


data, which is referred to as network analytics. With the large numbers of smart
objects in IoT networks that are communicating and streaming data, it can be
challenging to ensure that these data flows are effectively managed, monitored,
and secure. Network analytics tools such as Flexible NetFlow and IPFIX provide
the capability to detect irregular patterns or other problems in the flow of IoT
data through a network

Software Defined Networking in IoT

Software−defined Networking in the Internet of Things (IoT) presents a


formidable architecture that enhances the adaptability and flexibility of
networks. By seamlessly abstracting multiple network layers, SDN
revolutionizes network control, empowering enterprises and service providers to
swiftly adapt to evolving business demands. This cutting−edge approach seeks
to optimize network management and empower organizations with the agility
needed to thrive in an ever−changing digital landscape.

SDN's inherent ability to provide abstractions empowers network


administrators to exert holistic control over the network, utilizing high−level
policies without having to concern themselves with the intricacies of low−level
configurations. Consequently, leveraging SDN proves advantageous in
addressing the heterogeneous nature of IoT and catering to its unique
application−specific demands.
Types of Software Defined Networking

Open SDN: Experience the power of open protocols as they orchestrate and
govern both virtual and physical devices, seamlessly directing the flow of data
packets.

API SDN: Unleash the potential of programming interfaces, known as


southbound APIs, to regulate the intricate exchange of data between devices,
ensuring efficient data flow management.

Overlay M odel SDN: Embark on a virtual networking journey that transcends


physical limitations. Overlay Model SDN constructs a virtual network layer
above existing hardware infrastructure, encompassing data tunnels and
channels to data centers. This innovative model skillfully allocates bandwidth
within each channel and effectively assigns devices to their designated
channels.

Hybrid M odel SDN: Embrace the best of both worlds with the Hybrid Model
SDN. By seamlessly blending the realms of SDN and traditional networking,
this versatile approach enables the optimal selection of protocols for various
traffic types. Harness the power of Hybrid SDN as a phased implementation
strategy for a smooth transition into the world of SDN.

Significance of Software Defined Netw orking in IoT


Software−Defined Networking (SDN) in the Internet of Things (IoT) signifies a
considerable improvement over traditional networking, delivering a range of
essential benefits:

Enhanced Control with Unparalleled Speed and Flexibility: SDN elimin ates the
need for manual configuration of various hardware devices from different
vendors. Instead, developers can exert control over network traffic by
programming a software based controller adhering to open standards. This
approach empowers networking managers with the freedom to select
networking equipment and communicates with multiple hardware devices using
a single protocol via a centralized controller, resulting in remarkable speed and
flexibility.

Customizable Network Infrastructure: With SDN, administrators can


centrally design network services and swiftly allocate virtual resources to
modify the network infrastructure. This capability allows network
administrators to prioritize applications that demand increased availability and
optimize the flow of data across the network according to specific requirements.

Robust Security: SDN in IoT offers comprehensive visibility across the entire
network, presenting a holistic view of potential security threats. As the number
of intelligent devices connecting to the Internet continues to proliferate, SDN
surpasses traditional networking in terms of security advantages. Operators
can create distinct zones for devices requiring different security levels or
promptly isolate compromised devices to prevent the spread of infections
throughout the network.

By embracing Software−Defined Networking in IoT, organizations can unlock


the potential for greater control, customization, and security within their
networks, paving the way for optimized performance and improved management
of IoT deployments.

Risks of Software Defined Networking in IoT

From bolstering agility and control to streamlining management and


configuration, SDN presents a compelling case for adoption. However, it is
imperative to acknowledge the potential risks that accompany this technological
marvel. One prominent concern lies in the centralized nature of the controller,
which, if compromised, could act as a single point of failure. Nevertheless,
proactive measures can mitigate this vulnerability by implementing controller
redundancy throughout the network, complete with automatic fail−over
capabilities. While this endeavor may incur additional expenses, it aligns with
the principles of maintaining business continuity, akin to the judicious addition
of redundancy to other critical network components.

Distinguishing Software Defined Networking in IoT from Traditional


Networking

The dissimilarity between Software−Defined Networking (SDN) and traditional


networking lies primarily in their underlying infrastructure. While traditional
networking relies on hardware components, SDN operates on a software basis. This
fundamental variance endows SDN with remarkable flexibility that surpasses the
confines of traditional networking. Through a software−driven control panel, SDN
empowers administrators to oversee the network, modify configuration settings,
allocate resources, and augment network capacity from a centralized user interface,
all without necessitating additional hardware deployment. Moreover, SDN and
traditional networking diverge in terms of security. SDN, being software defined,
boasts enhanced security attributes owing to its heightened visibility and ability to
define secure pathways. However, it is imperative to safeguard the centralized
controller as it represents a potential vulnerability and single point of failure within
SDN networks, which could compromise the network's overall security.

IIoT Analytics: M achine Learning

M achine Learning

Machine learning is a subset of Artificial Intelligence which enables machines

to make decisions based on their experience rather than being


explicitly programmed.

M achine learning Overview

Machine learning is, in fact, part of a larger set of technologies commonly


grouped under the term artificial intelligence (AI).

In fact, AI includes any technology that allows a computing system to mimic


human intelligence using any technique, from very advanced logic to basic “if-
thenelse” decision loops. Any computer that uses rules to make decisions
belongs to this realm.

A typical example is a dictation program that runs on a computer. The program


is configured to recognize the audio pattern of each word in a dictionary, but it
does not know your voice’s specifics—your accent, tone, speed, and so on.

You need to record a set of predetermined sentences to help the tool match
well- known words to the sounds you make when you say the words. This
process is called machine learning.

ML is concerned with any process where the computer needs to receive a set of
data that is processed to help perform a task with more efficiency. ML is a vast
field but can be simply divided in two main categories: supervised and
unsupervised learning.

Types of Machine Learning Algorithms

1. Unsupervised Learning

3. Reinforcement Learning

2. Supervised Learning
Unsupervised Learning

This machine learning technique is used to identify similar groups of


data,coined as clustering. The segregation of data is performed on unlabeled

dataset, based on the inner structure of the data without looking into the

specific outcome.

Supervised learning

In supervised learning, the machine is trained with input for which there is a
known correct answer. For example, suppose that you are training a system to
recognize when there is a human in a mine tunnel.

A sensor equipped with a basic camera can capture shapes and return them to
a computing system that is responsible for determining whether the shape is a
human or something else (such as a vehicle, a pile of ore, a rock, a piece of
wood, and so on.)
With supervised learning techniques, hundreds or thousands of images are fed
into the machine, and each image is labeled (human or nonhuman in this case).
This is called the training set. An algorithm is used to determine common
parameters and common differences between the images.

The comparison is usually done at the scale of the entire image, or pixel by
pixel. Images are resized to have the same characteristics (resolution, color
depth, position of the central figure, and so on), and each point is analyzed.
Human images have certain types of shapes and pixels in certain locations.

Each new image is compared to the set of known “good images,”

and a deviation is calculated to determine how different the new image is from
the average human image and, therefore, the probability that what is shown is
a human figure. This process is called classification.

After training, the machine should be able to recognize human shapes. Before
real field deployments, the machine is usually tested with unlabelled pictures—
this is called the validation or the test set, depending on the ML system used—
to verify that the recognition level is at acceptable thresholds. If the machine
does not reach the level of success expected, more training is needed.

Reinforcement Learning Algorithm

It is a machine learning algorithm which enables machines to improve

its performance by automatically learning the ideal behaviors for a

specific environment.
Data science :

Julia programming

Julia provides us unobtrusive yet a powerful and dynamic type system.

With the help of multiple dispatch, the user can define function behavior across
many combinations of arguments.It has powerful shell that makes Julia able to
manage other processes easily.The user can cam call C function without any
wrappers or any special APIs.Julia provides an efficient support for Unicode.

It also provides its users the Lisp-like macros as well as other metaprogramming
processes.It provides lightweight green threading, i.e., coroutines.

It is well-suited for parallelism and distributed computation.

The coding done in Julia is fast because there is no need of vectorization of code
for performance.

It can efficiently interface with other programming languages such as Python,


R, and Java. For example, it can interface with Python using PyCall, with R
using RCall, and with Java using JavaCall.

 Open source
 Distributed computation and parallelism possible
 Support efficiently Unicode
 Call c functions directly

Basics of Julia programming

Use println() is used to print

Variables can be assigned without defining the type

Basic math

Assigning string
Use of $ sign for string interpolation

String concatenation

Data structures
1. Tuples

Dictionary

3. Arrays
Data M anagement

chived or disposed off in a safe


and secure manner during and after the conclusion of a research project

handled electronically as well as through non-electronic means


most industrial data –

Hadoop

across large clusters of computers


-source implementation for Google File System (GFS) and
MapReduce

components originally derived respectively from Google's


MapReduce and GFS.

Building Blocks of Hadoop

aining the utilities that support the other Hadoop


components

M apReduce
that process large amount of datasets in
parallel

-generation MapReduce
Hadoop cluster.

HDFS Architecture and Components

HDFS follows the master-slave architecture and it has the following


elements.

Namenode
The namenode is the commodity hardware that contains the GNU/Linux
operating system and the namenode software. It is a software that can be
run on commodity hardware. The system having the namenode acts as
the master server and it does the following tasks −

Manages the file system namespace.


Regulates client’s access to files.
It also executes file system operations such as renaming, closing, and
opening files and directories.

Datanode
The datanode is a commodity hardware having the GNU/Linux operating
system and datanode software. For every node (Commodity
hardware/System) in a cluster, there will be a datanode. These nodes
manage the data storage of their system.

Datanodes perform read-write operations on the file systems, as per client


request.
They also perform operations such as block creation, deletion, and
replication according to the instructions of the namenode.

Block
Generally the user data is stored in the files of HDFS. The file in a file
system will be divided into one or more segments and/or stored in
individual data nodes. These file segments are called as blocks. In other
words, the minimum amount of data that HDFS can read or write is
called a Block. The default block size is 64MB, but it can be increased as
per the need to change in HDFS configuration.

Goals of HDFS
Fault detection and recovery − Since HDFS includes a large number of
commodity hardware, failure of components is frequent. Therefore HDFS
should have mechanisms for quick and automatic fault detection and
recovery.

Huge datasets − HDFS should have hundreds of nodes per cluster to


manage the applications having huge datasets.

Hardware at data − A requested task can be done efficiently, when the


computation takes place near the data. Especially where huge datasets
are involved, it reduces the network traffic and increases the throughput.

Inserting Data into HDFS


Assume we have data in the file called file.txt in the local system which is
ought to be saved in the hdfs file system. Follow the steps given below to
insert the required file in the Hadoop file system.

Step 1
You have to create an input directory.

$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/input


Step 2
Transfer and store a data file from local systems to the Hadoop file
system using the put command.

$ $HADOOP_HOME/bin/hadoop fs -put /home/file.txt /user/input


Step 3
You can verify the file using ls command.

$ $HADOOP_HOME/bin/hadoop fs -ls /user/input


Retrieving Data from HDFS
Assume we have a file in HDFS called outfile.
Given below is a simple demonstration for retrieving the required file from
the Hadoop file system.

Step 1
Initially, view the data from HDFS using cat command.

$ $HADOOP_HOME/bin/hadoop fs -cat /user/output/outfile


Step 2
Get the file from HDFS to the local file system using get command.

$ $HADOOP_HOME/bin/hadoop fs -get /user/output/


/home/hadoop_tp/
Shutting Down the HDFS
You can shut down the HDFS by using the following command.

$ stop-dfs.sh
There are many more commands in "$HADOOP_HOME/bin/hadoop fs"
than are demonstrated here, although these basic operations will get you
started. Running ./bin/hadoop dfs with no additional arguments will list
all the commands that can be run with the FsShell system. Furthermore,
$HADOOP_HOME/bin/hadoop fs -help commandName will display a
short usage summary for the operation in question, if you are stuck.

A table of all the operations is shown below. The following conventions


are used for parameters −

"<path>" means any file or directory name.


"<path>..." means one or more file or directory names.
"<file>" means any filename.
"<src>" and "<dest>" are path names in a directed operation.
"<localSrc>" and "<localDest>" are paths as above, but on the local file
system.
All other files and path names refer to the objects inside HDFS.
MapReduce is a framework using which we can write applications to
process huge amounts of data, in parallel, on large clusters of commodity
hardware in a reliable manner.

Hadoop Data M anagement


Hadoop is a powerful framework for data management, storage, and
processing, particularly suited for handling large -scale, distributed
datasets. Data management in Hadoop involves various aspects,
including data ingestion, storage, organization, processing, and retrieval.
Here are key concepts and components related to Hadoop data
management:

Hadoop Distributed File System (HDFS):


HDFS is the primary storage system in Hadoop, designed to store vast
amounts of data across a cluster of commodity hardware. It divides large
files into blocks and replicates them across multiple nodes for fault
tolerance.
Data Ingestion:

Data can be ingested into Hadoop using various methods, including


batch ingestion (e.g., using tools like Sqoop or Flume), real-time
streaming (e.g., Kafka), and manual uploads.
Data Storage:

Hadoop stores data in a distributed, fault-tolerant manner across the


HDFS cluster. Data is divided into blocks, typically 128 MB or 256 MB in
size, and these blocks are replicated to ensure data durability.
Data Formats:
Hadoop supports various data formats, including text, Avro, Parquet,
ORC, and others. Choosing the right format can impact storage efficiency
and query performance.
M etadata M anagement:
Metadata about the stored data, such as file locations, block replication
levels, and file structure, is maintained by the NameNode in HDFS. It
helps track and manage data across the cluster.
Data Organization:

Data can be organized into directories and subdirectories within HDFS.


Proper organization facilitates data discovery and management.
Data Processing:

Hadoop offers the MapReduce framework, which allows for distributed


data processing. Additionally, tools like Apache Spark, Hive, Pig, and
Flink provide higher-level abstractions for data processing and analytics.
Data Retrieval:

Users and applications can retrieve data from Hadoop using various
query and analysis tools. SQL-like languages (e.g., Hive’s HQL), scripting
languages (e.g., Pig Latin), and programming languages (e.g., Java,
Python) can be used for data retrieval.
Data Security:

Hadoop provides security features like authentication, authorization, and


encryption to protect data both in transit and at rest.
Data Lifecycle Management:
Managing the lifecycle of data includes data retention policies, archiving,
data purging, and data backup strategies.
Data Quality and Governance:

Ensuring data quality, integrity, and compliance with regulatory


requirements is essential. Data governance practices and tools help
maintain data quality and compliance.
Data Catalogs and M etadata Repositories:

Metadata about data assets, such as data lineage, data definitions, and
data ownership, can be stored in data catalogs and metadata repositories
to aid in data discovery and usage.
Data Compression and Optimization:

Data compression techniques are often employed to reduce storage


requirements and improve data processing performance. Tools like
Apache ORC and Apache Parquet use columnar storage and compression
to optimize data storage and querying.
Data Backup and Disaster Recovery:
Implementing backup and disaster recovery strategies is critical to ensure
data availability and business continuity.
Data Retention Policies:
Defining and enforcing data retention policies helps manage data growth
and ensures that only relevant and necessary data is retained.
Data Privacy and Compliance:
Compliance with data privacy regulations, such as GDPR or HIPAA, is
crucial when managing sensitive or personal data within Hadoop clusters.

You might also like