UNIT IV
DATA ANALYTICS AND SUPPORTING
SERVICES
Structured data
• Structured data is data that has been
predefined and formatted to a set structure
before being placed in data storage
• Example: Relational database
• Advantages
• Easily used by machine learning algorithms
• Easily used by business users
• Increased access to more tools
Disadvantages
• Lack of flexibility
• predefined purpose limits use
• Limited storage options
Unstructured data
• Unstructured data is data stored in its native
format and not processed until it is used,
which is known as schema-on-read.
• Advantages
• Freedom of the native format
• Faster accumulation rates
• Data lake storage
Disadvantages
• Requires data science expertise
• Requires specialized tools
Structured data vs. unstructured data
• Structured Data
– Self-service access
– Only select data types
– Schema-on-write
– Commonly stored in data warehouses
– Predefined format
Unstructured Data
Requires data science expertise
Many varied types conglomerated
Schema-on-read
Commonly stored in data lakes
Native format
Data in motion vs Data in rest
• Data in motion
• The collection process for data in motion is similar to
that of data at rest; however, the difference lies in the
analytics. In this case, the analytics occur in real-time
as the event happens
• Data at rest
This refers to data that has been collected from various
sources and is then analyzed after the event occurs.
The point where the data is analyzed and the point
where action is taken on it occur at two separate times
Data in motion
• Data in transit, or data in motion, is data
actively moving from one location to another
such as across the internet or through a
private network.
• Data protection in transit is the protection of
this data while it’s travelling from network to
network or being transferred from a local
storage device to a cloud storage device
Data at rest
• Data at rest is data that is not actively moving
from device to device or network to network
such as data stored on a hard drive, laptop,
flash drive, or archived/stored in some other
way.
• Data protection at rest aims to secure inactive
data stored on any device or network.
Role of machine learning in IoT
• IoT and machine learning deliver insights
otherwise hidden in data for rapid, automated
responses and improved decision making.
• Machine learning for IoT can be used to
project future trends, detect anomalies, and
augment intelligence by ingesting image,
video and audio.
Need for ML
• Machine learning -demystify the hidden patterns in IoT
data by analyzing massive volumes of data
• Machine learning inference - supplement or replace
manual processes with automated systems using
statistically derived actions in critical processes.
• With machine learning for IoT, you can:
• Ingest and transform data into a consistent format
• Build a machine learning model
• Deploy this machine learning model on cloud, edge and
device
Benefits
• Simplify machine learning model training
• Flexibility to use your data science library of choice
• Rapid model deployment to operationalize
machine learning quickly
• Prebuilt connectors for operational & historical
datastores
• Integration with Cumulocity IoT Streaming
Analytics
• Notebook integration
No SQL Databases
• NoSQL is a class of databases that support semi-
structured and unstructured data, in
• addition to the structured data handled by data
warehouses and MPPs.
• Includes different types of databases
• Document stores
• Key-value stores
• Wide column stores
• Graph stores
Hadoop
• Most popular choice as a data repository and
processing engine.
• Two key elements are still present in current
Hadoop distributions and provide the
foundation for other projects
• Hadoop Distributed File System (HDFS): A system
for storing data across multiple nodes
• MapReduce: A distributed processing engine that
splits a large task into smaller ones that can be run
in parallel
Distributed hadoop cluster
Hadoop
• Both MapReduce and HDFS take advantage of
this distributed architecture to store and
process massive amounts of data and are thus
able to leverage resources from all nodes in
the cluster.
• For HDFS, this capability is handled by
specialized nodes in the cluster, including
NameNodes and DataNodes
Hadoop
• NameNodes: These are a critical piece in data
adds, moves, deletes, and reads on HDFS. They
coordinate where the data is stored, and
maintain a map of where each block of data is
stored and where it is replicated.
• DataNodes: These are the servers where the
data is stored at the direction of the NameNode.
It is common to have many DataNodes in a
Hadoop cluster to store the data.
Writing file to HDFS
Hadoop Ecosystem
• Many organizations have adopted Hadoop
clusters for storage and processing of data and
have looked for complimentary software
packages to add additional functionality to their
distributed Hadoop clusters.
• Since the initial release of Hadoop in 2011, many
projects have been developed to add incremental
• functionality to Hadoop and have collectively
become known as the Hadoop ecosystem.
Apache Kafka
• Apache Kafka is a distributed publisher-
subscriber messaging system that is built to be
scalable and fast.
• It is composed of topics, or message brokers,
where producers write data and consumers
read data from these topics.
• The goal of Kafka is to provide a simple way to
connect to data sources and allow consumers to
connect to that data in the way they would like.
Apache Kafka dataflow
Apache Spark
• Apache Spark is an in-memory distributed
data analytics platform designed to accelerate
processes in the Hadoop ecosystem.
• The “in-memory” characteristic of Spark is
what enables it to run jobs very quickly.
• At each stage of a MapReduce operation, the
data is read and written back to the disk,
which means latency is introduced through
each disk operation.
Edge Streaming Analytics
• Key values of edge streaming analytics include
the following
• Reducing data at the edge
• Analyzing and response at the edge
• Time sensitivity
Edge Streaming Analytics – Core Functions
• Streaming analytics at the edge can be broken
down into three simple stages:
• Raw input data
• Analytics Processing Unit
• Output Streams
Edge analytics processing unit
Edge Analytics
• In order to perform analysis in real-time, the
APU needs to perform the following functions:
• Filter
• Transform
• Time
• Correlate
• Match patterns
• Improve business intelligence
Network Analytics
• Extremely important in managing IoT systems is
network-based analytics
• Network analytics is concerned with
discovering patterns in the communication
flows from a network traffic perspective
• Network analytics has the power to analyze
details of communications patterns made by
protocols and correlate this across the network.