Bigdata Unit II
Bigdata Unit II
Bigdata Unit II
P.SRIDEVI
DEPT. OF IT/CSE
UNIT II Stream Processing
• Mining data streams:
• Introduction to Streams Concepts,
• Stream Data Model and Architecture,
• // Stream Computing, Netflix Kinesis Data streams, Nasdaq Architecture
using AmazonS3
• Sampling Data in a Stream,
• Filtering Streams,
• Counting Distinct Elements in a Stream,
• Estimating Moments,
• Counting Oneness in a Window,
• Decaying Window,
• Real time Analytics Platform (RTAP) Applications,
•
Big data streaming is a process in which large streams of real-time data
are processed with the sole aim of extracting insights and useful trends
out of it.
Element………
Tuples ……….
Transactions .
.
archival
Actions on streaming data
Stream processing starts by ingesting data from a publish-subscribe
service, performs an action on it and then publishes the results back
to the publish-subscribe service or another data store. These actions
can include processes such as analyzing, filtering, transforming,
combining or cleaning data.
Actions that stream processing takes on data include
1. aggregations (e.g., calculations such as sum, mean, and standard
deviation),
2. analytics (e.g., predicting a future event based on patterns in the data),
3. transformations (e.g., changing a number into a date format),
4. enrichment (e.g., combining the data point with other ...
The best data streaming
tools, Analytics
• Amazon Kinesis
• Google Cloud DataFlow
• Azure Stream Analytics
• IBM Streaming Analytics
• Apache Storm
• StreamSQL
• Amazon Kinesis
Kinesis flexibility helps businesses to initially start with basic reports and
insights into data but as demands grow, it can be used for deploying machine
learning algorithms for in-depth analysis.
• Google Cloud DataFlow
By implementing streaming analytics, firms can filter data that is ineffectual
and slackens the analytics.
Utilising Apache Beam with Python, we can define data pipelines to extract,
transform, and analyse data from various IoT devices and other data sources.
• Apache Storm
Built by Twitter, the open-source platform Apache Storm is a must-have tool for
real-time data evaluation. Unlike Hadoop that carries out batch processing,
Apache Storm is specifically built for transforming streams of data. However,
it can be also used for online machine learning and ETL.
Strengths & Weakness of Data
STREAMS
• Advantages of Data Streams :
• real-time insights
• Increase Customer Satisfaction.
• Helps in minimizing costs
• It provides details to react swiftly to risk
• Disadvantages of Data Streams :
• NETWORK FAILURES
• RECENT DATA NOT AVAILABLE
• Lack of security of data in the cloud
• Big Data challenges are:
• Sharing and Accessing Data:
• Privacy and Security:It is another most important challenge with Big
Data. This challenge includes sensitive, conceptual, technical as well
as legal significance.
• Analytical Challenges:
• Technical challenges:Quality of huge data and storage:
• Scalability:Big data projects can grow and evolve rapidly. The
scalability issue of Big Data has lead towards cloud computing.
When the volume of the underlying data is very large, high speed and
continuous flow it leads to number of computational and mining
challenges listed below.
(1)Data contained in data streams is fast changing, high-speed and real-
time.
(2) Multiple or random access of data streams is in expensive rather
almost impossible.
(3)Huge volume of data to be processed in limited memory. This is a
challenge
(4)Data stream mining system must process high speed and gigantic
data within time limitations.
(5) The data arriving in multidimensional and low level so techniques to
mine such data needs to be very sophisticated.
(6)Data stream elements change rapidly overtime. Thus, data from the
past may become irrelevant for the mining.
Basics of stream Processing
• A stream is a graph of nodes connected by edges.
• Each node in the graph is an operator or adaptor that will process the
data in the stream.
• Nodes can have zero or more inputs and zero or more outputs
• The output of one node is connected to the input of another node or
nodes
• The edges of the graph that links the nodes together represent the
stream of data moving between the operators.
Data stream process
• Data stream processing is represented as a data flow graph consisting of data
sources, data processors, and data sinks, and it is used to deliver processed
results to external applications for analysis.
• The stream type of the input port matches the stream types of the output ports.
• Operators are source op, functor, sink operators.
FileSink
ODBCAppend
• Stream is a series of connected operators
• The initial set of operators are referred as source operators
• These operators(source) read the input stream and send the data
downstream
• The intermediate steps between input and output to downstream
comprise various operators to perform specific actions
• For every input stream there are multiple output streams.
• These outputs are called sink operators.
• Each of the operators can run on separate server in a cluster
• Stream should ensure that data flows from one operator to other
whether the operators are running on same servers or not.
Stream processing
• Streams are sequences of data that flow between operators in
a stream processing application.
• A stream can be an infinite sequence of tuples, such as a stream for a
traffic flow sensor.
• Operators process an input stream in one of the following logical
modes.
1.As each tuple 2. As a window
• A window is a sequence of tuples that is a subset of a possibly infinite
stream.
• A stream processing application is a collection of operators, all of
which process streams of data. One of the most basic stream
processing applications is a collection of three operators : a source
operator, one of the process operators, and a sink or output operator.
Applications of stream data
1. financial services- credit card operations in banks
2. stock market data
3. fraud detection
4. weather patterns during hail- storms, tsunami
5. healthcare equipment like heartbeat, bp, to save lives
6. Web server log records,
7. telecommunications : Call DATA records(CDR) where detailed usage
information
8. Defense, surveillance, cyber security
9. Transactions in retail chains in online e-Commerce sites,
Data stream is
• characterized by the following:
• A data stream is potentially unbounded in size.
• The data is being generated continuously in real time.
• The data is generated sequentially as a stream. It is most often
ordered by the timestamp assigned to each of the arriving records
implicitly (by arrival time) or explicitly (by generation time).
• Typically, the volume of the data is very large, and the rates are high
and irregular.
For example, storing one day’s worth of IP network traffic flows may
require as much as 2.5TB
Data Stream Management Systems (DSMSs) queries
• Such queries might include
1.detection and analyses of extreme events in the traffic and fraud
detection (e.g. sudden increase in trading volume in a stock market
application, unusual credit card transactions),
2. intrusion detection (e.g. DDoS attacks),
3. worms and viruses in IP network monitoring ,
4. traffic outliers (e.g. generation of statistics for network provisioning
and service level agreements),
5. network application identification using application specific
signatures(e.g. Peer2Peer file sharing applications without server), etc.
Processing of such tasks often involves evaluation of a number of
complex aggregations, joins, matching of various regular expressions,
generation of data synopses.
Event stream
processing(RealTimeProcessing)
• Event stream processing (ESP) is the practice of taking action on a
series of data points that originate from a system that continuously
creates data.
• The term “event” refers to each data point in the system, and
“stream” refers to the ongoing delivery of those events.
• A series of events can also be referred to as “streaming data” or “data
streams.”
• Actions that are taken on those events include aggregations (e.g.,
calculations such as sum, mean, standard deviation),
• analytics (e.g., predicting a future event based on patterns in the
data), and ingestion (e.g., inserting the data into a database).
• Event stream processing is necessary for situations where action needs
to be taken as soon as possible. This is why event stream processing
environments are often described as “real-time processing.”
Stream processing
5 LOGICAL LAYERS OF STREAM
ARCHTECTURE
The Components of a Streaming Architecture