Bigdata Unit II

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 57

BIGDATA UNIT-II

P.SRIDEVI
DEPT. OF IT/CSE
UNIT II Stream Processing
• Mining data streams:
• Introduction to Streams Concepts,
• Stream Data Model and Architecture,
• // Stream Computing, Netflix Kinesis Data streams, Nasdaq Architecture
using AmazonS3
• Sampling Data in a Stream,
• Filtering Streams,
• Counting Distinct Elements in a Stream,
• Estimating Moments,
• Counting Oneness in a Window,
• Decaying Window,
• Real time Analytics Platform (RTAP) Applications,

Big data streaming is a process in which large streams of real-time data
are processed with the sole aim of extracting insights and useful trends
out of it.

A continuous stream of unstructured data is sent for analysis into


memory before storing it on to a disk. This happens across a cluster of
servers.
Speed matters the most in big data streaming. The value of data, if not
processed quickly, decreases with time.

Real-time streaming data analysis is a single-pass analysis. Analysts


cannot choose to reanalyze the data once it is streamed.
• Data stream is a high-speed continuous flow of data from diverse
resources.
• Data streaming is the next wave in the analytics as it assists
organisations in quick decision-making through real-time analytics.
• The sources might include
remote sensors,
scientific processes,
stock markets,
online transactions,
tweets,
internet traffic,
video surveillance systems etc.
• Generally these streams come in high-speed with a huge volume of data
generated by real-time applications.
Stream Concepts
• Stream is Data in motion
• Data is ingested in two modes
1. Batch mode (batch layer/COLD LAYER)
2. Stream mode (stream layer/ HOT LAYER)
• IBM Infosphere Streams is Analytics tool for Big Data Analytics for
analysing data in real time with micro latency(delay).
• Infosphere Stream design lets you leverage MPP(Massive Parallel
Processing )techniques to analyse data while it is streaming .
• It helps us in knowing what is happening in Real-Time and take action and
improve the results.
Batch VS Stream
Data stream characteristics
• Data streams have unique characteristics when compared with
traditional datasets.
• They include potentially infinite, massive, continuous, temporarily
ordered and fast changing.
• Storing such streams and then process is not viable as that needs a lot
of storage and processing power.
• For this reason they are to be processed in real-time in order to
discover knowledge from them instead of storing and processing like
traditional data mining.
• Thus the processing of data streams throw challenges in terms of
memory and processing power of systems
Characteristics of Data
Streams :
• Large volumes of continuous data, possibly infinite.
• Steady changing and requires a fast, real-time response.

• Random access is expensive and a single scan algorithm is


used to analyze data
• Store only the summary of the data seen so far.

• stream data are multidimensional in creation, needs


multilevel and multidimensional treatment.
• Data streams are sequences of training examples that arrive
continuously at high-speed from a one of more sources
• Data stream mining is a process of mining continuous incoming
real time streaming data with acceptable performance.
• Processing streaming data in order to discover knowledge is
given much importance recently as such data is made available
through rich internet applications.
• Streams are suited for both
1. Structured data 20%
2. semi-structured and un structured data(coming from sensors,
voice, text, video and other high volume sources)
Data stream model

Element………
Tuples ……….
Transactions .
.

archival
Actions on streaming data
Stream processing starts by ingesting data from a publish-subscribe
service, performs an action on it and then publishes the results back
to the publish-subscribe service or another data store. These actions
can include processes such as analyzing, filtering, transforming,
combining or cleaning data.
Actions that stream processing takes on data include
1. aggregations (e.g., calculations such as sum, mean, and standard
deviation),
2. analytics (e.g., predicting a future event based on patterns in the data),
3. transformations (e.g., changing a number into a date format),
4. enrichment (e.g., combining the data point with other ...
The best data streaming
tools, Analytics
• Amazon Kinesis
• Google Cloud DataFlow
• Azure Stream Analytics
• IBM Streaming Analytics
• Apache Storm
• StreamSQL
• Amazon Kinesis
Kinesis flexibility helps businesses to initially start with basic reports and
insights into data but as demands grow, it can be used for deploying machine
learning algorithms for in-depth analysis.
• Google Cloud DataFlow
By implementing streaming analytics, firms can filter data that is ineffectual
and slackens the analytics.
Utilising Apache Beam with Python, we can define data pipelines to extract,
transform, and analyse data from various IoT devices and other data sources.
• Apache Storm
Built by Twitter, the open-source platform Apache Storm is a must-have tool for
real-time data evaluation. Unlike Hadoop that carries out batch processing,
Apache Storm is specifically built for transforming streams of data. However,
it can be also used for online machine learning and ETL.
Strengths & Weakness of Data
STREAMS
• Advantages of Data Streams :
• real-time insights
• Increase Customer Satisfaction.
• Helps in minimizing costs
• It provides details to react swiftly to risk
• Disadvantages of Data Streams :
• NETWORK FAILURES
• RECENT DATA NOT AVAILABLE
• Lack of security of data in the cloud
• Big Data challenges are:
• Sharing and Accessing Data:
• Privacy and Security:It is another most important challenge with Big
Data. This challenge includes sensitive, conceptual, technical as well
as legal significance.
• Analytical Challenges:
• Technical challenges:Quality of huge data and storage:
• Scalability:Big data projects can grow and evolve rapidly. The
scalability issue of Big Data has lead towards cloud computing.
When the volume of the underlying data is very large, high speed and
continuous flow it leads to number of computational and mining
challenges listed below.
(1)Data contained in data streams is fast changing, high-speed and real-
time.
(2) Multiple or random access of data streams is in expensive rather
almost impossible.
(3)Huge volume of data to be processed in limited memory. This is a
challenge
(4)Data stream mining system must process high speed and gigantic
data within time limitations.
(5) The data arriving in multidimensional and low level so techniques to
mine such data needs to be very sophisticated.
(6)Data stream elements change rapidly overtime. Thus, data from the
past may become irrelevant for the mining.
Basics of stream Processing
• A stream is a graph of nodes connected by edges.
• Each node in the graph is an operator or adaptor that will process the
data in the stream.
• Nodes can have zero or more inputs and zero or more outputs
• The output of one node is connected to the input of another node or
nodes
• The edges of the graph that links the nodes together represent the
stream of data moving between the operators.
Data stream process
• Data stream processing is represented as a data flow graph consisting of data
sources, data processors, and data sinks, and it is used to deliver processed
results to external applications for analysis.
• The stream type of the input port matches the stream types of the output ports.
• Operators are source op, functor, sink operators.

FileSink

File source Functo Split


r

ODBCAppend
• Stream is a series of connected operators
• The initial set of operators are referred as source operators
• These operators(source) read the input stream and send the data
downstream
• The intermediate steps between input and output to downstream
comprise various operators to perform specific actions
• For every input stream there are multiple output streams.
• These outputs are called sink operators.
• Each of the operators can run on separate server in a cluster
• Stream should ensure that data flows from one operator to other
whether the operators are running on same servers or not.
Stream processing
• Streams are sequences of data that flow between operators in
a stream processing application.
• A stream can be an infinite sequence of tuples, such as a stream for a
traffic flow sensor.
• Operators process an input stream in one of the following logical
modes.
1.As each tuple 2. As a window
• A window is a sequence of tuples that is a subset of a possibly infinite
stream.
• A stream processing application is a collection of operators, all of
which process streams of data. One of the most basic stream
processing applications is a collection of three operators : a source
operator, one of the process operators, and a sink or output operator.
Applications of stream data
1. financial services- credit card operations in banks
2. stock market data
3. fraud detection
4. weather patterns during hail- storms, tsunami
5. healthcare equipment like heartbeat, bp, to save lives
6. Web server log records,
7. telecommunications : Call DATA records(CDR) where detailed usage
information
8. Defense, surveillance, cyber security
9. Transactions in retail chains in online e-Commerce sites,
Data stream is
• characterized by the following:
• A data stream is potentially unbounded in size.
• The data is being generated continuously in real time.
• The data is generated sequentially as a stream. It is most often
ordered by the timestamp assigned to each of the arriving records
implicitly (by arrival time) or explicitly (by generation time).
• Typically, the volume of the data is very large, and the rates are high
and irregular.
For example, storing one day’s worth of IP network traffic flows may
require as much as 2.5TB
Data Stream Management Systems (DSMSs) queries
• Such queries might include
1.detection and analyses of extreme events in the traffic and fraud
detection (e.g. sudden increase in trading volume in a stock market
application, unusual credit card transactions),
2. intrusion detection (e.g. DDoS attacks),
3. worms and viruses in IP network monitoring ,
4. traffic outliers (e.g. generation of statistics for network provisioning
and service level agreements),
5. network application identification using application specific
signatures(e.g. Peer2Peer file sharing applications without server), etc.
Processing of such tasks often involves evaluation of a number of
complex aggregations, joins, matching of various regular expressions,
generation of data synopses.
Event stream
processing(RealTimeProcessing)
• Event stream processing (ESP) is the practice of taking action on a
series of data points that originate from a system that continuously
creates data.
• The term “event” refers to each data point in the system, and
“stream” refers to the ongoing delivery of those events.
• A series of events can also be referred to as “streaming data” or “data
streams.”
• Actions that are taken on those events include aggregations (e.g.,
calculations such as sum, mean, standard deviation),
• analytics (e.g., predicting a future event based on patterns in the
data), and ingestion (e.g., inserting the data into a database).
• Event stream processing is necessary for situations where action needs
to be taken as soon as possible. This is why event stream processing
environments are often described as “real-time processing.”
Stream processing
5 LOGICAL LAYERS OF STREAM
ARCHTECTURE
The Components of a Streaming Architecture

• streaming architecture must include these four key building blocks:


1. The Message Broker / Stream Processor
2. Batch and Real-time ETL Tools
3. Data Analytics / Serverless Query Engine
4. Streaming Data Storage
Apache Kafka is an open-source, distributed event streaming platform
used for high-performance data pipelines, streaming analytics, data
integration, and mission-critical applications.
Kafka Stream: Designed for continuous stream processing with support
for windowed operations and stateful processing.
Kafka Topic: Stores and forwards messages without inherent stream
processing capabilities.
Streaming Data Architecture
• A streaming data architecture is an information technology
framework that puts the focus on processing data in motion
and treats extract-transform-load (ETL) batch processing as
just one more event in a continuous stream of events.
• This type of architecture has three basic components –
1) An aggregator that gathers event streams and batch files
from a variety of data sources,
2) A broker that makes data available for consumption and
3) An analytics engine that analyzes the data, correlates values
and blends streams together.
Message broker
• Message Brokers are used to send a stream of events from the
producer to consumers through a push-based mechanism. Message
broker runs as a server, with producer and consumer connecting to it
as clients. Producers can write events to the broker and consumers
receive them from the broker.
1.This is the element that takes data from a source,
called a producer, translates it into a standard message
format, and streams it on an ongoing basis. Other
components can then listen in and consume the
messages passed on by the broker..
Two popular stream processing tools
Message broker(server) clients are Apache Kafka and
Writes events Amazon Kinesis Data Streams

Receive events client


2. Data events from one or more message brokers must be
aggregated, transformed, and structured before data can be
analysed with SQL-based analytics tools. This would be done by
an ETL tool or platform that receives queries from users, fetches
events from message queues, then applies the query to generate a
result
The result may be an API
call, an action, a
visualization, an alert, or in
some cases a new data
stream. Examples of
open-source ETL tools for
streaming data are
Message
events Apache Storm,
Spark Streaming, and
WSO2 Stream Processor.
3. Data Analytics / Serverless
Query Engine
• streaming data is prepared for consumption by the stream processor.
It must be analyzed to provide value. There are many different
approaches to streaming data analytics. Here are some of the tools
most commonly used for streaming data analytics.

Kafka streams can be


Streaming usecase
processed and
ANALYTICS TOOL Low latency serving of
persisted to a
Cassandra streaming events to
Cassandra cluster for
apps
decision making
4. Streaming Data Storage

• low cost storage technologies are allowing most organizations to


store their streaming event data.

Cons: High latency,


Pros: Agile, no need to
makes real time
In a data lake – structure data into
analysis difficult.
Ex: Amazon S3 tables. Low cost
Difficult to perform
storage.
SQL analytics
FILTERING AND SAMPLING IN DATA STREAMS
Large volumes of data are being continuously generated in real time,
taking the form of an ordered, unbounded sequence of items, or a data
stream.
Such applications include:
• financial data analyses,
• sensor networks ,
• IP network monitoring ,
• phone records log processing and others.
These applications often require sophisticated processing capabilities
for continuous monitoring of the input stream in order to detect
changes or find interesting patterns over the data in a timely manner.
sampling
• Input tuples enter at a rapid rate, at one or more
input ports.
• The system cannot store the entire stream accessibly.
• How do you make critical calculations about the
stream using a limited amount of (secondary)
memory?
• Since we can’t store the entire stream, one obvious
approach is to store a sample
SAMPLING TECHNIQUES
FIXED SIZE SAMPLING
BIASED RESERVOIR SAMPLING
CONCISE SAMPLING
• Scenario: search engine query stream
Another solution
Filter operator
• Filtering condition of a stream item is independent of other items of
the same stream or any other data stream. The most common
example of such filtering is stream sampling, when each item is
filtered out with a certain probability and the remaining items form
the desired sample.
• the stream processing application needs to filter the stock transaction
data for IBM® transaction records.
• Filter operator is used to extract relevant information from
potentially large volumes of data. As shown, the input for
the Filter operator is all the transactions;
• the output is only the IBM transactions.
• In general, the Filter operator receives tuples from an input stream
and submits a tuple to the output stream only if the tuple satisfies the
criteria that are specified by the filter parameter.
performs the following steps:
Receives a tuple from the input stream
(AllTransactions).
•1
If the value of the ticker attribute is IBM, it submits
the tuple to the output stream (IBMTransactions).
Repeats Steps 1 to 2 until all the tuples from the input
stream are processed.
• Stateful stream processing is a type of computing that processes a
continuous stream of data in real-time, while also retaining a current
state or context based on the data that has been processed so far.

You might also like