Bigdata Unit-Ii
Bigdata Unit-Ii
Bigdata Unit-Ii
P.SRIDEVI
DEPT. OF IT
UNIT II Stream Processing
• Mining data streams:
• Introduction to Streams Concepts,
• Stream Data Model and Architecture,
• Stream Computing, Netflix Kinesis Data streams, Nasdaq Architecture using
AmazonS3
• Sampling Data in a Stream,
• Filtering Streams,
• Counting Distinct Elements in a Stream,
• Estimating Moments,
• Counting Oneness in a Window,
• Decaying Window,
• Real time Analytics Platform (RTAP) Applications,
•
Big data streaming is a process in which large streams of real-time data
are processed with the sole aim of extracting insights and useful trends
out of it.
Element………
Tuples ……….
Transactions .
.
archival
The best data streaming tools, Analytics
• Amazon Kinesis
• Google Cloud DataFlow
• Azure Stream Analytics
• IBM Streaming Analytics
• Apache Storm
• StreamSQL
• Amazon Kinesis
Kinesis flexibility helps businesses to initially start with basic reports and
insights into data but as demands grow, it can be used for deploying machine
learning algorithms for in-depth analysis.
• Google Cloud DataFlow
By implementing streaming analytics, firms can filter data that is ineffectual
and slackens the analytics.
Utilising Apache Beam with Python, we can define data pipelines to extract,
transform, and analyse data from various IoT devices and other data sources.
• Apache Storm
Built by Twitter, the open-source platform Apache Storm is a must-have tool for
real-time data evaluation.
Unlike Hadoop that carries out batch processing, Apache Storm is specifically
built for transforming streams of data. However, it can be also used for online
Challenges of streaming data
• There are two challenges in developing new techniques that could
handle streaming data.
• The first challenge is to design fast mining method for handling
streaming data
• The second challenge is detecting data distribution and changing
concepts in a highly dynamic environment.
When the volume of the underlying data is very large, high speed and
continuous flow it leads to number of computational and mining
challenges listed below.
(1)Data contained in data streams is fast changing, high-speed and real-
time.
(2) Multiple or random access of data streams is in expensive rather
almost impossible.
(3)Huge volume of data to be processed in limited memory. This is a
challenge
(4)Data stream mining system must process high speed and gigantic
data within time limitations.
(5) The data arriving in multidimensional and low level so techniques to
mine such data needs to be very sophisticated.
(6)Data stream elements change rapidly overtime. Thus, data from the
past may become irrelevant for the mining.
Basics of stream Processing
• A stream is a graph of nodes connected by edges.
• Each node in the graph is an operator or adaptor that will process the
data in the stream.
• Nodes can have zero or more inputs and zero or more outputs
• The output of one node is connected to the input of another node or
nodes
• The edges of the graph that links the nodes together represent the
stream of data moving between the operators.
Data stream process
FileSink
ODBCAppend
Stream processing
• Streams are sequences of data that flow between operators in
a stream processing application.
• A stream can be an infinite sequence of tuples, such as a stream for a
traffic flow sensor.
• Operators process an input stream in one of the following logical
modes.
1.As each tuple 2. As a window
• A window is a sequence of tuples that is a subset of a possibly infinite
stream.
• A stream processing application is a collection of operators, all of
which process streams of data. One of the most basic stream
processing applications is a collection of three operators : a source
operator, one of the process operators, and a sink or output operator.
• Stream is a series of connected operators
• The initial set of operators are referred as source operators
• These operators(source) read the input stream and send the data
downstream
• The intermediate steps between input and output to downstream
comprise various operators to perform specific actions
• For every input stream there are multiple output streams.
• These outputs are called sink operators.
• Each of the operators can run on separate server in a cluster
• Stream should ensure that data flows from one operator to other
whether the operators are running on same servers or not.
Applications of stream data
1. financial services- credit card operations in banks
2. stock market data
3. fraud detection
4. weather patterns during hail- storms, tsunami
5. healthcare equipment like heartbeat, bp, to save lives
6. Web server log records,
7. telecommunications : Call DATA records(CDR) where detailed usage
information
8. Defense, surveillance, cyber security
9. Transactions in retail chains in online e-Commerce sites,
streaming data architecture
• A streaming data architecture is an information technology
framework that puts the focus on processing data in motion and
treats extract-transform-load (ETL) batch processing as just one more
event in a continuous stream of events.
• This type of architecture has three basic components –
1.)an aggregator that gathers event streams and batch files from a
variety of data sources,
2)a broker that makes data available for consumption and
3) an analytics engine that analyzes the data, correlates values and
blends streams together.
Stream processing
The Components of a Streaming
Architecture
• streaming architecture must include these four key building blocks:
1. The Message Broker / Stream Processor
2. Batch and Real-time ETL Tools
3. Data Analytics / Serverless Query Engine
4. Streaming Data Storage
Message broker
• Message Brokers are used to send a stream of events from the
producer to consumers through a push-based mechanism. Message
broker runs as a server, with producer and consumer connecting to it
as clients. Producers can write events to the broker and consumers
receive them from the broker.
1.This is the element that takes data from a source, called a producer,
translates it into a standard message format, and streams it on an
ongoing basis. Other components can then listen in and consume the
messages passed on by the broker..