Bigdata Unit-Ii

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 33

BIGDATA UNIT-II

P.SRIDEVI
DEPT. OF IT
UNIT II Stream Processing
• Mining data streams:
• Introduction to Streams Concepts,
• Stream Data Model and Architecture,
• Stream Computing, Netflix Kinesis Data streams, Nasdaq Architecture using
AmazonS3
• Sampling Data in a Stream,
• Filtering Streams,
• Counting Distinct Elements in a Stream,
• Estimating Moments,
• Counting Oneness in a Window,
• Decaying Window,
• Real time Analytics Platform (RTAP) Applications,

Big data streaming is a process in which large streams of real-time data
are processed with the sole aim of extracting insights and useful trends
out of it.

A continuous stream of unstructured data is sent for analysis into


memory before storing it onto disk. This happens across a cluster of
servers.
Speed matters the most in big data streaming. The value of data, if not
processed quickly, decreases with time.

Real-time streaming data analysis is a single-pass analysis. Analysts


cannot choose to reanalyze the data once it is streamed.
• Data stream is a high-speed continuous flow of data from diverse resources.
• Data streaming is the next wave in the analytics and machine learning
landscape as it assists organisations in quick decision-making through real-time
analytics.

• The sources might include


remote sensors,
scientific processes,
stock markets,
online transactions,
tweets,
internet traffic,
video surveillance systems etc.
• Generally these streams come in high-speed with a huge volume of data
generated by real-time applications.
Stream Concepts
• Stream is Data in motion
• Data is ingested in two modes
1. Batch mode (batch layer)
2. Stream mode (stream layer)
• IBM Infosphere Streams is Analytics tool for Big Data Analytics for
analysing data in real time with micro latency.
• Infosphere Stream design lets you leverage MPP(Massive Parallel
Processing )techniques to analyse data while it is streaming . It helps
us in knowing what is happening in Real-Time and take action and
improve the results.
Batch VS Stream
Data streams have unique characteristics
• Data streams have unique characteristics when compared with
traditional datasets.
• They include potentially infinite, massive, continuous, temporarily
ordered and fast changing.
• Storing such streams and then process is not viable as that needs a lot
of storage and processing power.
• For this reason they are to be processed in real-time in order to
discover knowledge from them instead of storing and processing like
traditional data mining.
• Thus the processing of data streams throw challenges in terms of
memory and processing power of systems
Characteristics of Data Streams :

• Large volumes of continuous data, possibly infinite.


• Steady changing and requires a fast, real-time response.
• Data stream captures nicely our data processing needs.
• Random access is expensive and a single scan algorithm
• Store only the summary of the data seen so far.
• stream data are multidimensional in creation, needs multilevel and
multidimensional treatment.
• Data streams can be conceived as sequences of training examples that
arrive continuously at high-speed from a one of more sources
• Data stream mining is a process of mining continuous incoming real
time streaming data with acceptable performance.
• Processing streaming data in order to discover knowledge is given much
importance recently as such data is made available through rich internet
applications.
• Streams are suited for both
• 1. Structured data 20%
• 2. semi-structured and un structured data(coming from sensors, voice,
text, video and other high volume sources)
Data stream model

Element………
Tuples ……….
Transactions .
.

archival
The best data streaming tools, Analytics
• Amazon Kinesis
• Google Cloud DataFlow
• Azure Stream Analytics
• IBM Streaming Analytics
• Apache Storm
• StreamSQL
• Amazon Kinesis
Kinesis flexibility helps businesses to initially start with basic reports and
insights into data but as demands grow, it can be used for deploying machine
learning algorithms for in-depth analysis.
• Google Cloud DataFlow
By implementing streaming analytics, firms can filter data that is ineffectual
and slackens the analytics.
Utilising Apache Beam with Python, we can define data pipelines to extract,
transform, and analyse data from various IoT devices and other data sources.
• Apache Storm
Built by Twitter, the open-source platform Apache Storm is a must-have tool for
real-time data evaluation.
Unlike Hadoop that carries out batch processing, Apache Storm is specifically
built for transforming streams of data. However, it can be also used for online
Challenges of streaming data
• There are two challenges in developing new techniques that could
handle streaming data.
• The first challenge is to design fast mining method for handling
streaming data
• The second challenge is detecting data distribution and changing
concepts in a highly dynamic environment.
When the volume of the underlying data is very large, high speed and
continuous flow it leads to number of computational and mining
challenges listed below.
(1)Data contained in data streams is fast changing, high-speed and real-
time.
(2) Multiple or random access of data streams is in expensive rather
almost impossible.
(3)Huge volume of data to be processed in limited memory. This is a
challenge
(4)Data stream mining system must process high speed and gigantic
data within time limitations.
(5) The data arriving in multidimensional and low level so techniques to
mine such data needs to be very sophisticated.
(6)Data stream elements change rapidly overtime. Thus, data from the
past may become irrelevant for the mining.
Basics of stream Processing
• A stream is a graph of nodes connected by edges.
• Each node in the graph is an operator or adaptor that will process the
data in the stream.
• Nodes can have zero or more inputs and zero or more outputs
• The output of one node is connected to the input of another node or
nodes
• The edges of the graph that links the nodes together represent the
stream of data moving between the operators.
Data stream process

FileSink

File source Functor Split

ODBCAppend
Stream processing
• Streams are sequences of data that flow between operators in
a stream processing application.
• A stream can be an infinite sequence of tuples, such as a stream for a
traffic flow sensor.
• Operators process an input stream in one of the following logical
modes.
1.As each tuple 2. As a window
• A window is a sequence of tuples that is a subset of a possibly infinite
stream.
• A stream processing application is a collection of operators, all of
which process streams of data. One of the most basic stream
processing applications is a collection of three operators : a source
operator, one of the process operators, and a sink or output operator.
• Stream is a series of connected operators
• The initial set of operators are referred as source operators
• These operators(source) read the input stream and send the data
downstream
• The intermediate steps between input and output to downstream
comprise various operators to perform specific actions
• For every input stream there are multiple output streams.
• These outputs are called sink operators.
• Each of the operators can run on separate server in a cluster
• Stream should ensure that data flows from one operator to other
whether the operators are running on same servers or not.
Applications of stream data
1. financial services- credit card operations in banks
2. stock market data
3. fraud detection
4. weather patterns during hail- storms, tsunami
5. healthcare equipment like heartbeat, bp, to save lives
6. Web server log records,
7. telecommunications : Call DATA records(CDR) where detailed usage
information
8. Defense, surveillance, cyber security
9. Transactions in retail chains in online e-Commerce sites,
streaming data architecture
• A streaming data architecture is an information technology
framework that puts the focus on processing data in motion and
treats extract-transform-load (ETL) batch processing as just one more
event in a continuous stream of events.
• This type of architecture has three basic components –
1.)an aggregator that gathers event streams and batch files from a
variety of data sources,
2)a broker that makes data available for consumption and
3) an analytics engine that analyzes the data, correlates values and
blends streams together.
Stream processing
The Components of a Streaming
Architecture
• streaming architecture must include these four key building blocks:
1. The Message Broker / Stream Processor
2. Batch and Real-time ETL Tools
3. Data Analytics / Serverless Query Engine
4. Streaming Data Storage
Message broker
• Message Brokers are used to send a stream of events from the
producer to consumers through a push-based mechanism. Message
broker runs as a server, with producer and consumer connecting to it
as clients. Producers can write events to the broker and consumers
receive them from the broker.
1.This is the element that takes data from a source, called a producer,
translates it into a standard message format, and streams it on an
ongoing basis. Other components can then listen in and consume the
messages passed on by the broker..

Two popular stream processing tools


Message broker(server) clients are Apache Kafka and
Writes events Amazon Kinesis Data Streams

Receive events clieent


2. Data events from one or more message brokers must be aggregated, transformed,
and structured before data can be analysed with SQL-based analytics tools. This
would be done by an ETL tool or platform that receives queries from users, fetches
events from message queues, then applies the query to generate a result

The result may be an API


call, an action, a
visualization, an alert, or in
some cases a new data
stream. Examples of
open-source ETL tools for
streaming data are
Message
events Apache Storm,
Spark Streaming, and
WSO2 Stream Processor.
3. Data Analytics / Serverless Query
Engine
• streaming data is prepared for consumption by the stream processor.
It must be analyzed to provide value. There are many different
approaches to streaming data analytics. Here are some of the tools
most commonly used for streaming data analytics.

Kafka streams can be


Streaming usecase
ANALYTICS TOOL processed and persisted to a
Low latency serving of
Cassandra Cassandra cluster for
streaming events to apps
decision making
4. Streaming Data Storage

• low cost storage technologies are allowing most organizations to


store their streaming event data.

Cons: High latency, makes


Pros: Agile, no need to
In a data lake – real time analysis difficult.
structure data into tables.
Ex: Amazon S3 Difficult to perform SQL
Low cost storage.
analytics
• Unstructured data : JSON OR XML<KEY,VALUE>

You might also like